Rare Sequential Pattern Mining of Critical Infrastructure ... · Rare Sequential Pattern Mining of...

Rare Sequential Pattern Mining of

Critical Infrastructure Control Logs for

Anomaly Detection

by

Anisur Rahman

Bachelor of Science in Computer Science and Engineering(The University of Asia Paci�c) � August 2002

Master of Science in Computer Science and Engineering(Da�odil International University) � December 2009

Submitted in ful�lment of the requirement for the degree

of Doctor of Philosophy

Information Security Discipline

Science and Engineering Faculty

Queensland University of Technology

2019

Keywords

Frequent Pattern, Rare Pattern, Sequential Database, Critical Infrastructure,

SCADA Control System, Anomaly Detection, Itemset Pattern Mining, Sequen-

tial Pattern Mining, Rare Sequential Pattern Mining, Sequential Association

Rules Mining.

i

Abstract

The importance to provide cybersecurity for Supervisory Control and Data Ac-

quisition (SCADA) control systems is now recognised as a world-wide problem.

These SCADA systems are used to drive much of a nation's critical infrastruc-

ture, which by de�nition is essential for the nation's citizen's way of life. SCADA

control systems no longer operate in isolation, which had the added bene�t that

it provided a level of protection from anomalies or intrusions. They are con-

nected to the computer networks and internet systems to operate, control and

monitor their operations. This connection to the Internet enables the SCADA

system exposed to cyber-attacks. Therefore, there is a need to have a detection

system which discovers anomalies that may have occurred on a system.

Log �les record the process activities of the SCADA control system. These

logs can be analysed to detect abnormal process activities treated as anomalies

on the control system. Attacks or anomalies on a system may be frequent and

rare, but this thesis is only concerned with rare anomalies. The main objective of

this thesis is to design and develop an anomaly detection method from SCADA

control logs by using rare sequential pattern mining technique. In addition, this

thesis also aims to develop a method for possible anomaly prediction on SCADA

control system. To achieve the main objective of this thesis, it is considered that

anomalies are a rare phenomenon in a system. So, we propose and develop a

new rare sequential pattern mining approach to �nd rare or infrequent patterns.

Since the goal of pattern mining is typically to �nd the regular behaviour of a

system, rare behaviour of a system is often explicitly ignored and discarded. Rare

patterns can provide valuable information indicating anomalous or unacceptable

behaviour of a system. To �nd e�ective rare patterns, all rare sequential patterns

sharing the same frequency or support value are put into di�erent groups. The

smallest pattern in each group is the minimal rare sequential pattern, while

the largest pattern is the maximal rare sequential pattern. We evaluated the

iii

rare sequential pattern mining method using SCADA control system log data

containing cyber incidents. The identi�ed rare anomalous sequences were attacks

on the system, demonstrating the usefulness of the rare sequential pattern mining

approach.

Next, we used constraints to improve the e�ectiveness and e�ciency of the

proposed rare sequential pattern mining algorithm. The constraints were used

to generate only useful rare patterns to detect anomalies on SCADA system.

The e�ciency was improved by reducing the computational time while generat-

ing useful rare patterns because the constrained rare sequential pattern mining

algorithm generated less number of patterns. While achieving the improved ef-

fectiveness and e�ciency, the proposed rare sequential pattern mining algorithm

did not compromise the anomaly detection accuracy, which requires security op-

erators less e�ort and time to detect anomalies.

Finally, we developed a sequential association rule mining approach to pre-

dict possible anomaly on SCADA control systems. In this method, we used rare

sequential patterns to generate association rules. These rules were then used

to predict possible anomalies by using streaming logs from the SCADA control

system. The results from this thesis demonstrate that anomalies can be de-

tected from SCADA control logs by applying our rare sequential pattern mining

approach. The results also demonstrate that anomaly prediction on SCADA

systems can also be done by using sequential association rules.

iv

Contents

Keywords . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi

Declaration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii

Previously Published Material . . . . . . . . . . . . . . . . . . . . . . . xv

Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvi

Chapter 1 Introduction 1

1.1 Background and Motivation . . . . . . . . . . . . . . . . . . . . . 2

1.2 Research Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.3 Research Aims and Scope . . . . . . . . . . . . . . . . . . . . . . 7

1.4 Research Contributions . . . . . . . . . . . . . . . . . . . . . . . . 8

1.5 Structure of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . 10

Chapter 2 Background and Literature Review 12

2.1 Background of SCADA Control System . . . . . . . . . . . . . . . 12

2.2 SCADA Test-bed Scenario . . . . . . . . . . . . . . . . . . . . . . 14

2.3 Anomalies in SCADA Control System . . . . . . . . . . . . . . . . 17

2.4 Anomalies Detection Methods . . . . . . . . . . . . . . . . . . . . 18

2.4.1 Signature-based Detection . . . . . . . . . . . . . . . . . . 19

2.4.2 Anomaly-based Detection . . . . . . . . . . . . . . . . . . 20

2.5 Data Mining and Machine Learning . . . . . . . . . . . . . . . . . 21

2.5.1 Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.5.2 Machine Learning Methods . . . . . . . . . . . . . . . . . 22

2.5.3 Supervised Learning Method . . . . . . . . . . . . . . . . . 23

2.5.4 Unsupervised Learning Method . . . . . . . . . . . . . . . 24

2.5.5 Semi-supervised Learning Method . . . . . . . . . . . . . . 25

v

2.6 Pattern Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.6.1 Itemset Pattern Mining . . . . . . . . . . . . . . . . . . . . 28

2.6.2 Sequential Pattern Mining . . . . . . . . . . . . . . . . . . 29

2.7 Constraint-based Pattern Mining . . . . . . . . . . . . . . . . . . 32

2.8 Association Rule Mining . . . . . . . . . . . . . . . . . . . . . . . 35

2.9 Existing Anomaly Detection in SCADA System . . . . . . . . . . 39

2.10 Summary and Research Gaps . . . . . . . . . . . . . . . . . . . . 43

Chapter 3 A Rare Sequential Pattern Mining Approach for Anomaly

Detection 48

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.2 De�nitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3.3 A New Method For Finding Rare Sequential Patterns . . . . . . . 59

3.3.1 Generating Rare Sequential Generator Patterns . . . . . . 61

3.3.2 Generating All Rare Sequential Patterns . . . . . . . . . . 64

3.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

3.4.1 SCADA System Architecture . . . . . . . . . . . . . . . . 67

3.4.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

3.4.3 Experimental Methodology . . . . . . . . . . . . . . . . . . 74

3.4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

3.5 Discussion and Analysis . . . . . . . . . . . . . . . . . . . . . . . 81

3.5.1 E�ectiveness of Equivalence Class . . . . . . . . . . . . . . 81

3.5.2 Computational Complexity . . . . . . . . . . . . . . . . . . 83

3.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

Chapter 4 Constraint-based Rare Sequential Pattern Mining 90

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

4.1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

4.2 Existing Related Work . . . . . . . . . . . . . . . . . . . . . . . . 94

4.3 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

4.4 Constraint-based Rare Sequential Pattern Mining Algorithm . . . 98

4.4.1 Generating Constrained Rare Sequential Generator Patterns 99

4.4.2 Generating Constrained Rare Sequential Patterns . . . . . 101

4.5 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . 104

4.5.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

vi

4.5.2 Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . 108

4.5.3 Experimental methodology . . . . . . . . . . . . . . . . . . 111

4.6 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 115

4.6.1 Conveyor-belt Control System . . . . . . . . . . . . . . . . 115

4.6.2 Pressure Control System . . . . . . . . . . . . . . . . . . . 120

4.6.3 Water Tank Control System . . . . . . . . . . . . . . . . . 125

4.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

4.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

Chapter 5 A Rare Sequential Association Rules Mining of SCADA

Streaming Logs for Anomaly Prediction 134

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

5.1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

5.2 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

5.3 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

5.4 A New Anomaly Prediction Method Using Sequential Association

Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

5.4.1 Generating Sequential Association Rules . . . . . . . . . . 142

5.4.2 Prediction of Anomalies using Sequential Association Rules 145

5.5 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . 151

5.5.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

5.5.2 Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . 152

5.5.3 Experimental Methodology . . . . . . . . . . . . . . . . . . 153

5.6 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 155

5.6.1 Conveyor-belt Control System . . . . . . . . . . . . . . . . 156

5.6.2 Pressure Control System . . . . . . . . . . . . . . . . . . . 157

5.6.3 Water Tank Control System . . . . . . . . . . . . . . . . . 158

5.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

5.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163

Chapter 6 Conclusion and Future Work 165

6.1 Research Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 165

6.2 Future Research Directions . . . . . . . . . . . . . . . . . . . . . . 168

Bibliography 170

vii

List of Figures

2.1 A simplistic view of SCADA control system layout. . . . . . . . . 13

2.2 A physical laboratory view of the SCADA test-bed. . . . . . . . . 15

2.3 A logical view of SCADA test-bed process control system. . . . . 16

2.4 A data mining approach for information extraction. . . . . . . . . 21

2.5 Supervised learning method. . . . . . . . . . . . . . . . . . . . . . 23

2.6 Unsupervised learning method. . . . . . . . . . . . . . . . . . . . 25

2.7 Semi-supervised learning method. . . . . . . . . . . . . . . . . . . 26

2.8 A sequence diagram. . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.1 A partial lattice view of a sequential database. . . . . . . . . . . 54

3.2 The positive and the negative border of a lattice of a sequential

database. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

3.3 An equivalence class of rare sequential patterns. . . . . . . . . . . 66

5.1 Anomaly prediction from streaming logs. . . . . . . . . . . . . . . 141

ix

List of Tables

2.1 Transaction database TDB. . . . . . . . . . . . . . . . . . . . . . 28

2.2 A sequential database SDB . . . . . . . . . . . . . . . . . . . . . 30

2.3 A market basket transaction database. . . . . . . . . . . . . . . . 36

3.1 A sequential database SDB. . . . . . . . . . . . . . . . . . . . . . 53

3.2 Execution of Algorithm 3.1. . . . . . . . . . . . . . . . . . . . . . 63

3.3 A partial view of a conveyor belt control system log. . . . . . . . . 69

3.4 A partial view of a pressure control system log. . . . . . . . . . . 69

3.5 A partial view of a water tank control system log. . . . . . . . . . 70

3.6 A partial view of a conveyor belt control system logs from the

second dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

3.7 A partial view of a pressure control system log from the second

dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

3.8 A partial view of a water tank control system log from the second

dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

3.9 A sample of the conveyor belt SDB generated from Dataset-1 in

the First Dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . 73

3.10 A sample of the pressure control SDB generated from Dataset-2

in the First Dataset. . . . . . . . . . . . . . . . . . . . . . . . . . 74

3.11 A sample of the water tank SDB generated from Dataset-3 in the

First Dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

3.12 A sample of the rare sequential patterns from conveyor belt SDB

in Dataset-1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

3.13 A sample of rare sequential patterns from pressure control SDB

in Dataset-2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

3.14 A sample of rare sequential patterns from water tank SDB in

Dataset-3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

3.15 A sample of rare sequential patterns from conveyor belt SDB in

Dataset-4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

xi

3.16 A sample of rare sequential patterns from pressure control SDB

in Dataset-5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

3.17 Comparison among the databases regarding the number of fre-

quent generators our algorithm produced and the number of fre-

quent generators produced by FEAT algorithm. . . . . . . . . . . 84

4.1 A sequential database SDB with events' occurrence time-stamp. . 92

4.2 A partial view of a conveyor belt control logs. . . . . . . . . . . . 105

4.3 A partial view of a pressure control logs. . . . . . . . . . . . . . . 106

4.4 A partial view of a water tank control logs. . . . . . . . . . . . . . 106

4.5 Confusion matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

4.6 A partial view of the conveyor-belt result from the Experiment-1. 116

4.7 A partial view of the conveyor-belt result from Experiment-2. . . 117

4.8 A comparison table showing the number of rare sequential patterns

and the computational time taken by the four experiments on the

conveyor-belt database. . . . . . . . . . . . . . . . . . . . . . . . . 118

4.9 A partial view of the pressure control result from Experiment-1. . 121

4.10 A partial view of the pressure control result from Experiment-2. . 122

4.11 A comparison table showing number of rare patterns and time

taken by all 4 experiments on pressure control SDB. . . . . . . . . 123

4.12 A partial view of the water tank result from Experiment-1. . . . . 125

4.13 A partial view of the water tank result from Experiment-2. . . . . 126

4.14 A comparison table showing number of rare patterns and compu-

tational time taken by the four experiments on water tank control

system database. . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

5.1 A sequential database SDB. . . . . . . . . . . . . . . . . . . . . . 139

5.2 A view of possible rare sequential association rules. . . . . . . . . 148

5.3 Examples of sequential association rules from three control systems.156

5.4 Anomaly predictions from the three control system streaming logs. 157



xii

QUT Verified Signature

QUT Verified Signature

Previously Published Material

The following articles have been published, and contain material based on the

content of this thesis.

(i) Anisur Rahman, Yue Xu, Kenneth Radke and Ernest Foo. Finding Anoma-

lies in SCADA Logs Using Rare Sequential Pattern Mining. In 10th In-

ternational Conference on Network and System Security, Springer; pages

499− 506, September 28− 30, 2016, Taipei, Taiwan.

(ii) Anisur Rahman, Yue Xu, Kenneth Radke and Ernest Foo. A Rare Sequen-

tial Pattern Mining Approach For Anomaly Detection. In the Journal of

Knowledge and Information Security. (Review Submitted)

xv

Acknowledgements

First of all I express my humble and sincere gratitude to Almighty Allah (SWT)

Who bestowed upon me with the knowledge, wisdom, health, and mental strength

to undertake the Ph.D. research and enabled me to complete it. Next, I am

greatly indebted to my wonderful supervisory team comprising Dr. Ernest Foo

who is the principal supervisor, and Associate professor Yue Xu, and Dr. Ken-

neth Radke are the two associate supervisors. They have spent a lot of time to

guide me in every regular weekly meetings, where we discussed on my research

updates. Their consistent guidance, encouragement and constructive feedbacks

on my research works and writings helped me to achieve my research goals.

Thanks to Ernest once again for merging the information security and data sci-

ence together in my research that helped me to expand my knowledge domain.

I am equally thankful to Professor Yue for guiding me to learn the data science

theory, tools and techniques needed for my research. Also, thanks to Kenneth

for all the tips while implementing the algorithms.

I would like to give thanks to Professor Yuefeng Li and Dr. Matthew McK-

ague for being a panel members outside of my supervisory team of my Ph.D.

�nal seminar. I am grateful that they have taken their time to read my thesis

and give their valuable comments. Also, thanks to the external examiners for

spending their time reading and providing me with constructive suggestion to

improve the quality of my thesis. In addition, I am thankful to the Queensland

University of Science and Technology (QUT) for allocating me QUT Postgrad-

uate Research Award (QUTPRA), QUT Higher Degree Research Tuition Fee

Sponsorship, and Conference Travel Support. It would have been impossible to

start with my Ph.D. journey without the scholarship support. Furthermore, I

would also like be thankful to the unit coordinators, Dr. Ernest Foo, Associate

professor Yue Xu, Dr. Leonie Simpson, Dr. Matthew McKague, Dr. Wasana

Bandara, Prof. Yanming Feng who showed con�dence in me and gave me the

xvi

opportunities to work as a sessional academic at QUT. This has helped me to

gain academic teaching and learning experience in addition to providing me with

the �nancial support.

I would like to thank my colleagues and friends at Information Security Disci-

pline at QUT. I had the opportunity to learn the independent and collaborative

behaviour and how to develop the communication skills during the course of

my Ph.D. research candidature. As a token of my appreciation, I need to men-

tion their names: Nicholas Rodo�le, Hassan Fareed M Lahza, Iftekhar Salam,

David Myers, Jack Parry, Hassan Musallam Ahmed Qahur Al Mahri, Basker

Palaniswamy, Udyani Shanika Kumari Herath Mudiyanselage, Mir Ali Reza-

zadeh Baee, Shriparen Sriskandarajah, Mukhtar Hussain, Chathurika Pavithrani

Kumari, Niluka Amarasinghe, Tarun Bansal, Qinyi Li, James A. Akande, Chris

Djamaludin. Moreover, I extend my thanks to my roommates who have been a

part of my company at my o�ce GP S1051: Shane Black, Vikal Achrya, Udyani

and Fida. I am also giving thanks to those whom I used to meet on the lobby of

my o�ce for their smiling faces and showing their interest on my research.

Finally, I would like to express my gratitude to my parents for their struggle

and dedication for their children to become educated and good human beings.

May Allah (SWT) bestow His blessings and give continuous reward to both of

them. In honour my parents' struggle and dedication, I dedicate this thesis to my

parents. Last but not the least, my heartfelt thanks to my wife for her love and

belief in me and my children who are the gifts from Allah (SWT), my younger

brothers whom I love dearly, my mother-in-law, sister-in-laws, brother-in-law, my

uncles, aunties and all other family members for their love, a�ection, patience,

tolerance and sel�ess support throughout my Ph.D. research. There are many

relatives, friends and well-wishers whose names are too many to be mentioned

here by name, yet I am greatly thankful to them for their constant a�ection,

encouragement and support.

xvii

Dedicated to my parents

xviii

Chapter 1

Introduction

Supervisory Control and Data Acquisition (SCADA) control system network

is widely used in various applications involving important national critical in-

frastructures such as nuclear power plants. Any attack or malfunction to this

critical infrastructure can cause serious consequences to people, the environment

and industries linked to this infrastructure. There is a need to protect critical

infrastructure networks with the notion of defending the system from unwanted

incidents. However, defending is not always possible. This is because tradi-

tional SCADA control systems used to work in isolation, meaning they were not

connected to the internet. These systems would use proprietary software and

hardware which kept the system more secured. The SCADA systems were more

secure because it was hard to know the operations of the vendor speci�c software

and hardware that were used in the SCADA system. This way the operational

information was obscured from outside of the system. This means the system

provided security through obscurity.

However, modern SCADA systems are connected to the internet which ex-

poses the SCADA system to the external networks. In addition, the system

now uses o�-the-shelf hardware and software for its operation. As a result, it is

possible to get the SCADA hardware and software information which allows the

attacker to conduct a cyber-attack on the system. For example, the most promi-

nent attack that was performed in 2010 on the SCADA system was the Stuxnet

malware which hit a Nuclear Power Plant in Iran causing industrial damage [1].

1

2 Chapter 1. Introduction

Furthermore, as the con�guration of the SCADA system is typically not fre-

quently changed, the software is not updated. This is because the SCADA

system is designed to be functional for at least 20 years. The functionality of the

SCADA system is simple and has a limited number of operations which makes

the SCADA system unique from other conventional IT networks. Due to the

simplicity of the SCADA network, the con�guration of the system does not re-

quire to be updated which makes the system vulnerable for cyber-attack. It is

expected that the security incidents and attacks are to increase. As a legacy

system the SCADA networks were not designed to work securely in an environ-

ment such as internet. Therefore, defending SCADA systems from cyber-attack

is di�cult. Hence, a detection method is required to �nd anomalies or intrusions

on the SCADA system. Once an unwanted incident is detected, reactionary mea-

sures can be taken to reduce the consequences to the system. Detection is the

primary motivation of this research.

The rest of this chapter is organized as follows: Section 1.1 presents back-

ground and motivation of this research, Section 1.2 discusses the research prob-

lem, Section 1.3 de�nes research aims and scope, Section 1.4 provides the con-

tributions of this research, and �nally Section 1.5 concludes this chapter by

outlining a general structure of this thesis.

1.1 Background and Motivation

Anomaly detection is one step of several diverse measures that can be applied to

protect critical infrastructure control system networks such as SCADA control

systems. The control system consists of di�erent devices like Remote Termi-

nal Units (RTUs), Programmable Logic Controls (PLCs), Intelligent Electronic

Devices (IEDs) and di�erent sensors that connect physical devices to computer

networks for remotely monitoring and supervising critical infrastructure. There

are di�erent types of control systems based on their application areas. For exam-

ple, Process Control Systems (PCSs), Supervisory Control and Data Acquisition

(SCADA) systems, Distributed Control Systems (DCSs), and Building Manage-

ment Systems (BMSs) [2] [3].

The US government has de�ned transportation, oil and gas production and

storage, water supply, emergency services, government services, banking and �-

nance, electrical power, telecommunications as critical infrastructures [4]. How-

1.1. Background and Motivation 3

ever, we argue that di�erent countries may have di�erent policies to declare

their infrastructures as critical infrastructure. Critical infrastructures are closely

interdependent on each other. Therefore, if one of these establishments is af-

fected then other dependent infrastructures are a�ected in a cascading e�ect [5].

Therefore, these critical infrastructures are vital such that their inactivity or

destruction could bring impact to human lives, environment, and economy.

In recent times, SCADA systems are increasingly being used to monitor and

control the process activities of critical infrastructures. Modern SCADA systems

have come of age through several revolutionary steps. Now SCADA systems use

conventional IT technology as a backbone to communicate with �eld devices.

However, in the early stages of development these systems were being operated in

isolation, meaning not connected to external networks. Moreover, the system was

vendor-centric which means that it was operated only by software and hardware

manufactured by a speci�c vendor. So, it was di�cult to in�ict an attack on the

system due to a lack of available information about the system. In other words,

the security was ensured by obscurity.

Since modern SCADA systems use o�-the-shelf technology, the system is

prone to cyber-attacks. This is because adversaries can easily �nd tools and

techniques for conducting successful attacks on the infrastructure [4]. Further,

the sophistication of hacking tools is growing while the need for higher technical

knowledge for the intruder to cause harm is decreasing [6]. The impact of a

cyber-attack on control systems varies and can range from infrastructural asset

and environmental destruction to the loss of human lives [7].

While most of the cyber-attacks on SCADA networks remain undisclosed or

categorized as classi�ed by government agencies and industries, there are still

some prominent malicious attacks publicized in the literature. The �rst cyber-

attack recorded in 1982 at the Trans-Siberian gas pipeline in Soviet Union that

caused a huge explosion and �re was visible from space [8]. In 1998, a 12-year

old hacker managed to access to the Theodore Roosevelt Dam in Arizona by

gaining control of the computer that controls the �oodgates of the dam. It was

speculated that if the gates had been opened the cities, Tempe and Mesa, would

have been �ooded by the water [9]. In 2003, the Davies-Besse nuclear plant

in Ohio was disabled via the slammer worm for several hours [10]. In 2010,

the Stuxnet computer worm attacked a nuclear facility at Natanz in Iran which

caused power centrifuges to fail.


Since SCADA systems are using o�-the-shelf technology, meaning commercially

available hardware and software, it is expected these attack incident will grow

in the years to come. Some attacks are without established motivations and

yet can do catastrophic damage and others are non-critical and cannot cause

a catastrophic failure [11]. A recent report from Security Intelligence by Scott

Koegler [12] has reported that the cyber-threat has been increasing over the

time. In their 2017 report, they have shown that there was an increase of 90

million more intrusions than in 2016. Another report published by IBMManaged

Security Services (MSS) [13] has shown that the number of ICS attacks, which

can be de�ned as disrupting the process activities of the critical infrastructure

controlled by SCADA control networks, from 1st January 2013 to 30th August

2015 were on the rise.

As the number of attacks on ICS increases, the consequence of the attacks

demands constant monitoring to detecting anomalies or intrusions in the control

system network. This research aims to investigate SCADA control logs. The

reason for choosing logs is that SCADA process activities are recorded in log

�les. Any evidence of successful or attempted unsuccessful unauthorized access

or intrusion into the system may be recorded in log �les, which can be analyzed for

detecting anomalies or intrusions. We assume that the integrity and availability

of logs are ensured, meaning logs cannot be tampered. Hadºiosmanovi£ et al.

[14] state that compared to other logs used in di�erent domains, SCADAs logs

are in a good format to analyze. Garitano et al. [15] state that SCADA control

systems have a communication that is deterministic and activities that are for

most of the cases limited and recurrent. Any activity which is not recurrent

could be a rare pattern or phenomenon and hence an anomaly in the system.

However, �nding rare patterns from SCADA logs is challenging for the following

reasons:

(i) There is lack of existing knowledge as to rare sequential pattern mining

methods to �nd rare suspicious pattern from SCADA control logs. To the

best of our knowledge, there are no prior research to �nd rare sequential

patterns. Although there exist a few works that aim to �nd rare itemset

patterns, these methods do not preserve the order of the events. The

preservation of event order is important in SCADA control systems as

events occur in sequential manner. However, keeping the order of events is

di�cult and costly while generating rare patterns from the SCADA control

1.2. Research Problem 5

logs. This is because the large number of combinations of the events cost

computational time and search space.

(ii) It is important to conduct a forensic investigation of the SCADA system to

detect anomalies once it has occurred. However, due to the large volume of

logs generated by the SCADA control system, it is di�cult to �nd anomalies

manually. Therefore, developing an e�cient algorithm that automatically

detects the anomaly by analysing the logs is important. However, no pre-

vious algorithm has addressed this problem before using a rare sequential

pattern mining approach.

(iii) It is also important to make an early prediction of a possible anomaly

in a live SCADA control system. This is because due to early anomaly

prediction it may be possible to avoid the possible attack on the system.

However, prediction anomalies from a live SCADA system is di�cult as it

requires the analysis of streamed logs as they are generated.

Therefore, this research will apply pattern mining techniques, a research branch

of data mining techniques, to extract hidden and useful information about the

system activities or behavior from the logs. Among the pattern mining tech-

niques, this research will speci�cally be using sequential pattern mining as it

applies a sequential database to extract hidden and useful patterns. Frequent

patterns represent normal or expected behavior of a system. Sequences which

occur rarely in a system are called rare or infrequent sequential patterns. In this

research, it is assumed that anomalies or attacks happen very rarely in a system.

Therefore, this thesis introduces a rare sequential pattern mining technique to

�nd rare events that represent anomalous events or attacks on a system.

1.2 Research Problem

The SCADA control system network keeps records of process events or activities

in log �les. The events are recorded with timestamps and tagged with each in-

dividual event of the control system. The timestamps indicate when the events

occur in the system. As a result, the recorded logs resemble a sequence of events

that represent the process activities of the SCADA control system. The logs

record the normal process activities as well as unexpected abnormal process ac-

tivities. The normal process activities is produced by the expected outcome of


the control process, while the abnormal process activities include disruption in

the process activities resulting in an unexpected faulty outcome. The abnormal

process activities or anomalies could emanate from a natural failure or malfunc-

tion of the system. In addition, the abnormal process could be the result of a

cyber-attack on the system.

The analysis of SCADA control logs could help us to �nd anomalies in the

system. It is assumed that the anomalies this research will address rarely happen

in a system in comparison to the regular activities of the system. Although many

attacks have frequent records, but we concentrate on rare events. Therefore, by

analysing the logs we can �nd rare activities of the SCADA control system. Since

the SCADA logs are large, it is di�cult to analyse them manually to identify

anomalies. Data mining methods can be used to analyse these large log data

to extract useful information, that is, rare patterns to detect anomalies. There

has been much research in intrusion detection using signature based detection.

However, this method cannot detect unknown or zero day attacks, although

they can provide high detection rate and low false alarms. On the other hand,

there has been little work in anomaly based or behaviour based detection. This

method can detect not only the known attacks, but also unknown zero day

attacks. However, it generates a high false alarm rate.

The SCADA activities and topology usually do not change very frequently.

In other words, the actions and system's behaviour remain almost predictable

[16]. If any action deviates from the normal or expected behaviour of the system

then this action can be considered as an anomalous event and deserves further

investigation. Hadºiosmanovi£ et al. [17] used water treatment SCADA logs to

detect anomalous events by applying a data mining approach. They used itemset

pattern mining to �nd a single infrequent event as an anomalous event. How-

ever, this method cannot �nd a sequence of rare anomalous events because their

method did not consider the order of events. The order of the events is important

because the events are recorded in sequential manner in the log �le. In addition,

the order of events can change the process activities and hence the outcome.

Hadºiosmanovi£ et al's. [17] method cannot make a prediction regarding incom-

ing possible anomalies in the SCADA system. The above mentioned scenario has

led us to the following research problem: How can we analyse SCADA control

logs to design and develop an anomaly detection method based on rare sequential

patterns and also how can we predict possible anomalies in the SCADA control

1.3. Research Aims and Scope 7

system?

1.3 Research Aims and Scope

The primary aim of this research is to detect anomalies by analysing SCADA

control logs. It is assumed that anomalies are unexpected rare events or activities

compared to the regular activities of a system. As a common practice, activities

involving the operations of a system are recorded in a �le, such as system logs.

This is primarily done to detect or trace system faults, which could be caused

by natural failure of the system or may be caused by an intruder conducting

cyber-attacks on the system. To achieve the primary aim, that is, the detection

of anomalies in a SCADA system, this research sets the following three objectives

to perform:

(i) To design and develop a method for �nding anomalies that are rare in

SCADA control systems. The process events or activities of SCADA control

systems are mostly de�nitive and repetitive. This means a set of events are

performed in a repetitive manner to complete a process. Also, the events

are conducted in sequential manner. So, any rare event or sequence of

events would be a deviation from the normal or regular behaviour pro�le

of the SCADA control system. These rare events could be considered as

anomalies in the system. Therefore, there is a need to develop an algorithm

that can identify rare events from the SCADA control logs.

(ii) To improve the e�ciency of the rare sequential pattern mining algorithm

without losing accuracy by introducing constraints. This objective aims to

evaluate the proposed rare sequential pattern mining algorithm to improve

the e�ciency by generating less rare sequential patterns by removing unim-

portant rare patterns. As a result, the reduced number of rare patterns

can reduce the computational time. The smaller number of rare patterns

can be achieved by integrating constraints into the rare sequential pattern

mining algorithm. This objective would then check the accuracy of the

constraint-based rare sequential pattern mining algorithm behaviour once

the e�ciency is improved. It means the research would verify the accuracy

of the constraint-based rare sequential pattern mining algorithm in terms

of identifying anomalies from the reduced number of rare patterns. The re-

search also aims to verify if any overhead is added to the complexity of the


constraint-based rare sequential pattern mining algorithm. Finally, the re-

search will check the false positive status due to the inclusion of constraints

in the rare sequential pattern mining algorithm.

(iii) To provide an anomaly prediction method that can extend the work of the

rare sequential pattern mining algorithm. This objective aims not only to

detect anomalies, but also to predict anomalies in the SCADA system.

The concept of association rules can be used to predict future events in a

SCADA system. The prediction can be done on the live or streaming logs.

If incoming events in the streaming logs can be found in the association

rules which are generated from rare sequential patterns, it can be predicted

from the incoming logs that the remaining events may occur in the future.

The bene�t of this approach is that the system security operators could

be alerted about incoming anomalies or attacks in the system before they

occur.

1.4 Research Contributions

The main contribution of this research is to �nd rare sequential patterns from

a sequential database. This is the �rst approach in the literature in which it is

shown that anomalies can be detected using rare sequential patterns. The follow-

ing research contributions are presented based on the research background and

motivation, aims, scope, and objective of the research discussed in this chapter.

� Contribution 1: The �rst contribution of this thesis has been to design

and develop a novel method of generating rare sequential patterns from a

sequential database. This method which is presented in Chapter 3 deter-

mines whether rare sequential patterns can be used to detect anomalies

by analysing SCADA control logs. This method also analysed whether the

shortest length or minimal rare sequential pattern is e�ective in comparison

to the maximal or the largest length rare sequential patterns regarding the

detection of anomalies. This is because the minimal rare patterns manifest

the starting point of anomalous pattern while the maximal rare patterns

give the complete scenario of the anomalies.

1.4. Research Contributions 9

� Contribution 2: The second contribution of this research has been to

improve the e�ciency of our proposed rare sequential pattern mining algo-

rithm. The e�ciency is improved by integrating constraints in the proposed

algorithm presented in Chapter 4. These constraints are the time-span con-

straint, the feature reduction constraint, and the algorithmic constraint.

Time-span constraint is used so that only the signi�cant patterns discov-

ered. The feature reduction constraint is used to reduce the number of

unique events in the database so that small number of rare sequential pat-

terns can be generated. The feature reduction constraint also reduces com-

putational time as small number of rare patterns are generated. Finally,

the algorithmic constraint is used to avoid unwanted database scanning

which further reduces the computational time. Among these constraints,

the time-span constraint and the feature reduction constraint are used in

the data pre-precessing stage, while the algorithmic constraint is used with

the rare sequential pattern mining algorithm. The purpose of adding these

constraints is to reduce the number of rare sequential patterns so that

anomalies can be identi�ed from a smaller number of rare sequential pat-

terns with less computational time. It is possible that some other con-

straints can reduce the rare sequential patterns as well as the reduce the

computational time, but in our experiment we used the above constraints

for the solution. This method also ensures the accuracy of the algorithm is

not degraded while improving the e�ciency. In other words, the constraint-

based rare sequential pattern mining algorithm does not sacri�ce accuracy

compared to rare sequential pattern mining algorithm. Finally, this con-

strained method reduces the false positive in terms of anomaly detection.

� Contribution 3: The third contribution of this research has been to design

and develop a method to predict possible anomalies on SCADA streaming

logs. This anomaly prediction method is presented in Chapter 5. This

method builds on the proposed rare sequential pattern mining algorithm

to generate sequential association rules. In our experiment, we used the

longest antecedent association rules, although variable length antecedent

association rules can be generated. The variable length association rules

generate large anomaly predictions, which is due to frequent shorter an-

tecedent found in the streaming logs. Moreover, variable length association

rules generate redundant rules which also contributes to the large number


of anomaly predictions. This is because the shorter length antecedents are

the subsequence of the longest antecedents. As a result, for a single anoma-

lous pattern many predictions occur in the streaming logs. To reduce the

number of possible anomalies predictions and remove redundant rules, we

used the longest antecedent rules. These association rules are then used to

predict possible incoming anomalies once the antecedent of a rule is found

in the streaming logs. This method also detects anomalies if the prediction

occurs in the streaming logs.

1.5 Structure of the Thesis

The general structure of this thesis are presented as follows. Chapter 2 presents

the background and literature review of this research. The contributions of this

research are presented in Chapters 3, 4, and 5. The research conclusion and

future research direction are described in Chapter 6.

The following section presents brief overviews of each of the above mentioned

chapters of this thesis.

� Chapter 2 (Background and Literature Review): This chapter describes

the background of this research and relevant works pertaining to anomaly

detection in the literature. This chapter starts with the generic view of

the SCADA control system along with the laboratory test-bed setup of an

industry scale SCADA control system. The data mining and machine learn-

ing approaches used for anomaly detection have also been discussed. This

chapter also discusses the existing anomaly detection methods in SCADA

control system. Further, this research explains the reason for choosing

rare sequential pattern mining as an anomaly detection method. Finally,

this chapter is concludes with the identi�cation of the research gaps in the

literature.

� Chapter 3 (A Rare Sequential Pattern Mining Approach for Anomaly

Detection): This chapter presents the �rst contribution of this research.

In this chapter, a novel algorithm for rare sequential pattern mining has

been proposed. This chapter shows that this method can be used to detect

anomalies from a sequential database. To detect anomalies, this method

has been used to analyse SCADA control system logs. This chapter also

1.5. Structure of the Thesis 11

explains that depending on the domain application the size of the rare

patterns can play an important role in identifying anomalies in a system.

The minimal sized rare pattern indicates the beginning or starting of an

anomalous pattern while the maximal sized rare pattern represents the

entire scenario of an anomalous pattern.

� Chapter 4 (Constraint-based Rare Sequential Pattern Mining): This chap-

ter presents the second contribution of this research. This chapter is con-

cerned with improving the performance of our proposed rare sequential

pattern mining algorithm presented in Chapter 3. The performance was

achieved by introducing a constrained rare sequential pattern mining al-

gorithm, which improved the e�ciency while not degrading the accuracy

compared to without constrained rare sequential pattern mining algorithm

in Chapter 3.

� Chapter 5 (Sequential Association Rules Mining for Anomaly Prediction):

This chapter presents the third contribution of this research. Here, the

chapter discusses how sequential association rules generated from rare se-

quential patterns can be used to make predictions of possible anomalies in

SCADA control system. This prediction can be done from a live system

by analysing streaming SCADA control logs.

� Chapter 6 (Conclusion and Future Work): This chapter concludes the

thesis by summarising the research contributions discussed in Chapters

3, 4, and 5. In addition, this research has outlined some of the research

problems that remained open for researchers to extend this research work.

Chapter 2

Background and Literature Review

This chapter presents an overview of Industrial Control Systems (ICSs) such

as Supervisory Control and Data Acquisition (SCADA) systems and its compo-

nents, architecture and applications in Section 2.1. Section 2.2 discusses SCADA

test-bed scenario. Section 2.3 discusses anomalies in SCADA control systems.

In Section 2.4, anomalies detection methods have been discussed. Section 2.5

presents data mining and machine learning approaches for detection of anoma-

lies. In Section 2.6 through 2.8, we present pattern mining, constrained pattern

mining and association rule mining techniques, respectively. Section 2.9 pro-

vides state-of-the-art research status involving anomaly detection in the SCADA

control system. Finally, Section 2.10 draws the conclusion of this chapter.

2.1 Background of SCADA Control System

Information technology is connecting or bringing physical devices into a com-

puter network system. As a consequence, computer based control systems have

grown to monitor and control machinery and industrial processes from remote

geographical locations. These computer based control systems can be classi�ed

into di�erent categories considering their application areas, such as Supervisory

Control and Data Acquisition (SCADA), Process Control Systems (PCS), Dis-

tributed Control Systems (DCS), Cyber-Physical Systems (CPS) [3]. All of these

control systems are called Industrial Control System (ICS) because these con-

trol systems are used to monitor and control the process activities of di�erent

12

2.1. Background of SCADA Control System 13

industries.

There are various SCADA applications in the ICS, such as electricity, gas

and oil pipelines distribution, water utilities, transportation networks and ap-

plications. The infrastructure of these networks can be extended to a large

geographical areas. Therefore, there is a need to monitor and control the process

activities of these infrastructure from a remote location. Among di�erent con-

trolling networks, SCADA is widely used in electricity distribution sector [18].

The use of SCADA in power distribution systems started since 1960's and has

been gradually evolving with the development of newer technologies [19].

Figure 2.1: A simplistic view of SCADA control system layout.

A general simplistic SCADA diagram is shown in Figure 2.1 which is composed

of three main sections comprising both hardware and software. In other words,

a SCADA system is composed of physical and logical components [20]. The

left part of Figure 2.1 called supervisory systems and Human Machine Interface

(HMI) is considered as control center, the middle section is composed of commu-

nication backbone and protocols that connect the devices like Remote Terminal

Units (RTUs), Programmable Logic Controls (PLCs), Intelligent Electronic De-

vices (IEDs) which is on the far right with the control center. The RTUs collect

the data and converts them to digital signals that are relayed to the control cen-

ter (Supervisory Systems and HMI). The SCADA control system is considered as

the hub or nerve center that controls the critical infrastructures. The HMI device

is controlled by software that allows the control system operator to monitor the

14 Chapter 2. Background and Literature Review

process and events and may react to the situations if there is an emergency. The

supervisory system comprises of computer servers like data historian, Master

Terminal Unit (MTU) that collects, processes and logs the data sent by the �eld

devices such as RTUs, PLCs and IEDs. Further, the supervisory unit monitors

and sends commands to control the processes on the �eld devices. The commu-

nication infrastructure is composed of di�erent links, such as radio frequency,

telephone line, �ber in which some communication protocols are running to pass

information to and from SCADA devices.

2.2 SCADA Test-bed Scenario

To conduct the experiments in this research, we use our SCADA control system

industry scale test-bed network laboratory, meaning the test-bed network repre-

sents the similar usage for industrial strength SCADA system. There are some

reasons as to why this research needs a SCADA test-bed control system rather

than a real life control system. At the beginning of this research, we collected

some control logs from a SCADA controlled electrical substation. After analysing

the logs, we have found that the control system records the process activities in

a log �le whenever there is a malfunction in the system. In other words, all the

control system logs are error logs. The error logs could be a genuine failure of

the system or a cyber-attack. However, there is no identi�cation as to which

logs are real natural system failure logs and which logs are from a cyber-attack

if there is an attack on the system. Sometimes the system operator even do not

know whether their system has been compromised by a cyber-attack. Further,

the system never recorded the regular control process activities in the log �le.

Since this research aims to �nd anomalies or abnormalities in SCADA control

system, there is a need to have both the normal activity logs and anomalous logs

to evaluate the experimental results. Therefore, the logs collected from the real

life electrical substation control system are not suitable for the experiments of

this research. Also, it is not feasible to conduct attacks on a real life SCADA

control system to generate datasets. The reason is the output of the process

control system could be disrupted, if attacks are carried out. Further, even if

the administrator of the control system allows attacks on their system, due to

their organization policy they cannot share their dataset, because they may not

want to disclose their system's weakness to the public. Therefore, there is a need

2.2. SCADA Test-bed Scenario 15

to have a SCADA control system test-bed where attacks can be conducted that

generate datasets suitable for research experiments and to validate the anomaly

detection experimental results.

The SCADA test-bed is designed with three individual physical control sys-

tems named conveyor belt, pressure control and water tank system. A physical

Figure 2.2: A physical laboratory view of the SCADA test-bed.

laboratory view of the test-bed is shown in Figure 2.2. The pressure control

system is placed on the right side of the �gure labelled in number (1) in a circle.

In the middle of the �gure, the conveyor belt control system is placed that is la-

belled in number (2) in a circle. Finally, the water tank control system is placed

in the left side of the �gure which is labelled in (3) in a circle.

The water tank control system is consists of two tanks, the lower tank and the

upper tank. A water pump is used to �ll the upper tank by transferring the water

from the lower tank. The current water level on the upper tank is measured by a

sensor. The water in the upper tank increases from a lower threshold value to an

higher threshold value. Once the water level reaches the higher threshold value,

the water starts receding until it reaches to the lower threshold value. Gravity


allows water in the upper tank to move back into the lower tank. This process

continues to repeat for a de�ned period of time.

The conveyor belt is a bi-directional control system that separates light ob-

jects from the dark objects on a moving conveyor belt. Two sensors are used

to sort out the objects on the conveyor belt. The �rst sensor is used to detect

the object on the conveyor belt and the second sensor is used to color of the

object. The objects are collected in two di�erent directions; Left direction and

right direction, based on the colors of the objects which are detected by a sensor

built into the control system.

Figure 2.3: A logical view of SCADA test-bed process control system.

Finally, the pressure control system pressurises an object at a certain upper

threshold pressure value measured in pounds per inch (PSI). The pressure control

is connected to an air compressor. The air pressure inside the pressure control

pipeline increases to a prede�ned upper threshold value. Once the pressure level

reaches the upper threshold value, a solenoid valve is open to release the air to

drop pressure into a lower threshold value. When air pressure reaches the lower

threshold value, the solenoid closes the valve and the compressor starts to build

the pressure into the pipeline of the pressure control system. This process of

building and then releasing pressure continues for a de�ned time period.

2.3. Anomalies in SCADA Control System 17

The logical layout of the SCADA test-bed process control network is shown

in Figure 2.3. Every individual control system is attached with an industry

standard ICS device like Siemens S7-1200 PLCs. These three control systems are

connected to a master PLC that aggregates the logs produced by the three control

systems. Each control system is monitored and controlled by a Human Machine

Interface (HMI) connected to the process control network of the test-bed. The

HMI also pools the logs from the PLCs connected to each control system. The

three control systems and the master PLC is connected to a switch of an ethernet

network. The HMI and a Personal Computer (PC) is also connected to the

process control network. The attack PC is used to conduct attacks to disrupt

the process activities of the control system while generating the anomalous logs.

These three control devices could be attacked to disrupt the normal process

activities by running a Python script on the attack PC. The conveyor belt sorting

direction could be changed so that the white object on the belt could be sorted

to the site where the dark object is being collected or the dark object could be

sorted to the site where the white object is being collected. Another example

could be to change the pressure control system's upper and lower threshold value

from the de�ned set values. If the upper threshold value is changed to a very

high value, the pipe could burst or the compressor could explode.

2.3 Anomalies in SCADA Control System

The SCADA system is usually vulnerable to physical as well as cyber-attack,

and the attack on this system has been increasing many fold since 21st century

[18]. The main security concern of the ICS network is that the protocols and the

devices cannot withstand to cyber-attacks. This is because all the security goals,

that is, CIA (Con�dentiality, Integrity and Availability) are not addressed. The

ICS networks are not changed frequently compared to conventional IT networks.

Once an ICS network is con�gured, the set-up almost remains unchanged for

many years. This is because the functionalities performed by the ICS networks

are de�nitive and does not require frequent changes to the system. As a result,

the control system software is not updated which leaves the system vulnerable.

Furthermore, modern ICS networks are exposed to the internet which exposes

the control system to cyber-attacks. Firewalls cannot protect the ICS control

system. So, there is a need for an intrusion detection system (IDS) which can


detect cyber-attacks that have occurred on the system.

Although the importance of CIA are the same for both traditional or standard

IT networks and ICS networks, their security concerns or implementation prior-

ities are di�erent. For example, for standard IT protecting the data (providing

con�dentiality), ensuring correct command (maintaining integrity) and keeping

less number of interruption (availability of resources). In other words, CIA is

the order of importance. In addition, for ICS networks the priority is ensuring

correct commands (maintaining integrity), reducing interruption (availability of

resources) and protecting the data (providing con�dentiality), that is, integrity,

availability and con�dentiality (IAC). Moreover, the security issues are tradition-

ally de�ned or formulated by the ICS organizations to support their individual

goals. Therefore, security designed and applied to one infrastructure cannot be

fully implemented for other infrastructures. Security techniques or policies that

protect standard IT networks cannot be adopted for ICS networks [21], because

of the di�erences between these two networks (standard IT and ICS). These

di�erences include system characteristics, system maintenance and upgrading,

security practices, security counter measures and the cyber-attack impacts to

the systems. Among these, the consequences of successful cyber-attack is more

severe for ICS networks than standard IT networks considering the cost of dam-

ages involved with the system. Therefore, we argue that ICS networks or control

system networks (SCADA) need critical infrastructure speci�c security strate-

gies to safeguard the system because ICS is legacy system. So, new protocols is

di�cult and expensive. Instead, anomaly or intrusion detection and �rewall can

be added.

2.4 Anomalies Detection Methods

Malicious events or intrusion can be detected using anomaly detection techniques

[22]. Anomaly based detection was originally proposed by Denning [23] and since

then this method has been used in computer security for intrusion detection [24].

In general, algorithms used for anomalies or intrusions detection need to have

normal operation data that is called labeled data to build a training model.

These algorithms generally consider anomalies as patterns that have not seen

before in normal behavioral patterns of a system [23] [24]. The intrusion de-

tection techniques signature based and anomaly based methods were originally

2.4. Anomalies Detection Methods 19

used in the traditional IT (Information Technology) networks. Later these de-

tection techniques were gradually accommodated to the SCADA control system

network [25]. The operational activities of SCADA are not completely similar to

traditional IT and the SCADA control system is used to monitor and control the

process activities of another infrastructure, that is, the Industrial Control Sys-

tems (ICSs). Manganaris et al. [26] showed that frequent behaviour over a long

period of time could be considered as normal behaviour of a system. Therefore,

the absence of a frequent event or set of events can be considered as an anomaly.

For example, a speci�c alarm occurs in every minutes is normal than a burst of

alarms all of a sudden which never happened before is more suspicious than the

frequent alarms. Clifton et al. [27] applied a sequential association mining tech-

nique to identify normal behaviour of a system based on the frequent occurrence

of a sequence of of alarms event which was �ltered out later from suspicious event

lists. Julisch and Dacier [28] used the pattern mining episode rules and clustering

techniques to reduce irrelevant alarm signals using false positives from historical

alarms. The authors �rst discover the patterns of false positive and later re-

moves the false positive patterns from the possible anomalies alarm which helps

to reduce large number of alarms. These methods depend on signature based

rules which inherently lack the ability to identify new attacks. The data mining

based anomalies detection techniques can be used for both signature-based and

anomaly-based anomalies detection techniques [29] [24].

2.4.1 Signature-based Detection

The signature based method uses prior knowledge of attack signatures that de-

�nes the patterns of an attack. Fan et al. [30] used a signature based ANN

classi�er to detect malicious sequential patterns from a sequence of machine in-

structions. Hadºiosmanovi£ et al. [17] applied a semi-automated to analyse

a water treatment SCADA logs. The authors used a frequent itemset mining

approach to �nd a rare event by changing di�erent support values in SCADA

process logs. Using this approach the authors could only identify a single rare

event. Later they used the stakesholders' knowledge to identify whether the rare

event is an anomalous pattern or not. Although this method could identify rare

itermset pattern, this method lacks to identify sequence of events as rare anoma-

lous pattern. Their method inherently lacks the ability to identify new malware

which has no previous signature trained in their method. This is because the


human expertise is required to design, test and deploy the signatures. Therefore,

it is a time consuming e�ort and updates for the new attack signatures cannot be

readily or promptly generated and deployed into a system. This manual human

e�ort cannot cope with the rapidly changing behavior of attack patterns and

hence require automatic signature generation [31] [32]. In commercial purpose

signature based techniques are used for intrusion detection due to their high de-

tection rate, reliability and low false rate [33]. However, this technique cannot

identify unknown or zero day attacks because the constant change in the attack

patterns. Moreover, the signatures require well-de�ned rules for the possible at-

tacks which are almost impossible because of the constant changes in the system

vulnerabilities and attack patterns.

2.4.2 Anomaly-based Detection

The anomaly-based method of �nding anomalies uses normal operational be-

havior of a system or network. Wespi et al. [34] used behaviour-based anomaly

detection method to build a normal behaviour from audit data. To detect anoma-

lies, they distinguished the observed behaviour with the stored normal behaviour.

The behaviour-based method detects any deviation or unexpected behavior of

the system without human intervention. The normal behavior pro�le can be

created for each individual system. This technique applies to either machine

learning, data mining or statistical methods. These techniques are based on su-

pervised and unsupervised learning method. The supervised method requires

a prior knowledge or understanding of the system. However, the unsupervised

method does not require any previous knowledge of the system [31] [32]. The

unsupervised method automatically builds the system pro�le and any deviation

from the normal behavior pro�le is detected as an anomaly. As a result, it can

detect known as well as unknown attacks. Moreover, as there is no manual up-

date of signatures required for unsupervised method, the anomaly detection is

faster than signature based system. However, this method generates high false

alarm rate because it treats any previously unseen events, even newly added valid

events, as anomalies and therefore cannot be fully reliable.

2.5. Data Mining and Machine Learning 21

2.5 Data Mining and Machine Learning

2.5.1 Data Mining

Data mining is an analytical step in the knowledge discovery process [35]. Data

mining automatically discovers hidden, interesting, useful, and understandable

knowledge from a large collection of data [36]. The algorithms used in data min-

ing derive knowledge from various �elds, such as statistics, databases, pattern

recognition and machine learning [36]. The knowledge that is derived from a

particular application domain can be useful or applied in another application

domain. Cios et at. [35] developed a knowledge discovery model that accommo-

dates both academic research and industrial application aspects. It is comprised

of multiple steps that are executed in a sequence. Every step requires the previ-

ous step's result as input. The knowledge discovery model is an iterative process

with some feedback loops. To prepare dataset that can be used as input into the

data mining algorithms, initial raw data needs to be preprocessed. Data prepro-

cessing involves several steps such as selection of data, the cleansing of data, the

construction of data, the integration of data, and the formatting of data [37].

This step takes a comparatively large amount of time among these steps of the

entire knowledge discovery process [35].

After dataset preparation, meaning having the dataset ready for processing,

the actual data mining methods are used to extract user required information. In

this phase di�erent data mining algorithms are applied in the processed dataset

to extract the interesting and understandable knowledge. Our research, Data

Figure 2.4: A data mining approach for information extraction.

Mining Critical Infrastructure Control Logs for Anomaly Detection, uses data

mining experimental methods for detecting anomalies in SCADA control logs.


The data mining methods used are composed of several steps as shown in Figure

2.4. The steps are data source selection, data preprocessing, relevant data se-

lection, develop or choose data mining algorithm, pattern discovery and analyse

the patterns to extract knowledge for a practical application.

The main working domain of our research is the pattern mining process which

is one of the research branches in data mining methods. In pattern mining, there

are two major categories of research being done in the literature. These are

frequent pattern mining and infrequent or rare pattern mining. Our research uses

the rare pattern mining approach to discover unusual or unexpected behavior of

a system. There has been little work regarding the use of rare pattern mining

on SCADA control systems. Hadºiosmanovi£ et al. [17] used a rare itemset

pattern mining approach to detect anomalies by analysing logs from a SCADA

water treatment plant. However, they did not preserve the order of events on

the control logs. This thesis maintains the order of events to detect anomalies

by using a novel rare sequential pattern mining approach.

As the process activities of SCADA control system are de�nitive and repeti-

tive, the events become frequent which represent the regular or normal behaviour

of SCADA control system. In this thesis, it is hypothesized that irregular or ab-

normal behaviour occurs rarely in a system. Any changes to the SCADA process

control system makes it a rare behaviour of the system which can be identi�ed

by using rare sequential patterns. Rare anomalies can resemble the unexpected

irregular behavior of a system. Our hypothesis is tested in Chapter 3 of this

thesis with a rare pattern mining experiment using SCADA control logs.

After discovering the desired rare pattern, the pattern analysis step analyse

the rare patterns to �nd anomalies in the system that requires domain expertise.

In this �nal stage of knowledge discovery process, the discovered knowledge is

documented and applied inside the targeted system. It is also to be noted that

the knowledge discovered for one domain can be extended and applied to other

knowledge areas.

2.5.2 Machine Learning Methods

Machine learning is the study of learning from experience like human beings.

In the learning process, the machine learning algorithms maps a set of inputs

to the outputs with the help of computer programming. These algorithms are

used in data mining tasks for building automatic models to extract patterns or


knowledge from machine generated large volume of data. The traditional learning

methods could be categorized into two groups such as supervised learning and

unsupervised learning [38]. However, there is another instance of learning method

called semi-supervised learning that is a blend of supervised and unsupervised

learning method.

2.5.3 Supervised Learning Method

The supervised learning usually works in two phases, the �rst step is the training

phase and the second step is the testing phase. A model is built or trained with

a normal training data set and then check the model performance or accuracy

with test data set [24] [22]. The output of this model could be classi�cation or

regression. This model can be built only when labelled training data is available.

Example algorithms that require labelled data are decision tree and Support

Vector Machine (SVM) classi�ers. An example of a decision tree classi�er is given

in Figure 2.5. The decision tree is composed of three features Weather Forecast,

Humidity Condition and Wind Condition. These three features have a set of

de�ned values. For example, theWeather Forecast has three values sunny, cloudy

and raining. In means that on a particular day, the weather could be sunny or it

could be overcast or it could be a rainy day. Similarly, the other two features have

a de�ned set value. The Humidity Condition of a day could be normal or high.

Finally, the Wind Condition on a day could be strong or weak. The decision on

Figure 2.5: Supervised learning method.


playing a tennis match on a day depends on the combination of some values of

these features. Similarly, the decision of not playing a tennis match also depends

on these values. For example, the decision of playing a tennis match could depend

on (a) if it is a sunny day and the humidity condition is normal, or (b) if it is

a cloudy day, or (c) if it rains and the wind condition is weak. Although these

supervised methods are e�ective in detecting known anomalies, these methods

lack to identify unknown anomalies. In addition, these methods involves cost

of training the model involving preparing the labelled training dataset. Finding

anomalies using supervised methods in SCADA system cannot be e�ective as

these methods cannot �nd unknown or zero day attacks. Since SCADA systems

are used to control industrial control system, failure to detect zero day attack

could cause a devastating impact to the economy and environment.

2.5.4 Unsupervised Learning Method

The unsupervised learning method does not require any labelled data to detect

anomalies. In other words, the data is not classi�ed as attacked data or non-

attack data. Therefore, in an unsupervised model there is no training dataset

and testing dataset needed to detect anomalies from a dataset. The underlying

unknown structure or output can be generated from unlabelled data into di�er-

ent groups based on similarities in data such as clustering, dimension reduction

and association rules. Since data is not required to compare with the labelled

dataset, the learning process of the unsupervised model is faster in comparison

to the supervised learning mode. However, the anomaly detection accuracy of

the unsupervised model is lower than the supervised and semisupervised model

[38]. This is because the unsupervised method suspects any abnormal behaviour

as a potential anomaly on the system. In a clustering unsupervised learning

method shown in Figure 2.6, unlabelled data is distinguished from each other

by separating them into di�erent groups. These groups are formed based on

the close similarities of features among the data. In Figure 2.6, there are three

groups that can be generated from the unlabelled data. There are di�erent kinds

of unsupervised learning algorithms have been developed. The K-means is one

of the widely used clustering algorithms. This clustering method cannot be ap-

propriate for �nding anomalies in SCADA control system. This is because in

SCADA system, the events occur in a sequential manner where a sequence of

events completes a task. These events are correlated to each other. Therefore,


Figure 2.6: Unsupervised learning method.

these events cannot be separated into di�erent clusters for �nding anomalous se-

quence of events. Therefore, to �nd anomalous sequence of events from SCADA

control system, sequential pattern mining method can be used.

2.5.5 Semi-supervised Learning Method

Semi-supervised learning resides in between supervised and unsupervised meth-

ods. There are some circumstances when the semi-supervised method of learning

is needed. Sometimes it is hard to �nd labelled data, because labelling data may

require expertise which is di�cult to achieve and hence expensive. Also, it may

be time consuming and may require special devices to label the data. Further-

more, it may happen when the amount of input data is large in quantity and

only a few of them are labelled leaving a large amount of the data unlabelled.

The semi-supervised algorithm tries to �nd strong inductive biases from a large

number of unlabelled data [39]. A general approach is shown in Figure 2.7 where

γ is a hidden structure associated with both the object A and B. The object B′

contains a few labelled examples of object B. Some data in object A is labelled

and some are unlabelled data. The unlabelled data in A assists in inferring the

object B using the hidden structure γ. Another example of a semi-supervised


Figure 2.7: Semi-supervised learning method.

learning is an infant word-object mapping [40] to measure the ability to asso-

ciate the word and object. If an infant listens to a word many times before the

word's corresponding object (labelled data) is seen, the association is stronger.

However, if the word has not been heard before, the association is weak.

2.6 Pattern Mining

Pattern mining is an important and widely studied task in data mining. It allows

the extraction of interesting hidden information as well as relations among data,

such as association rules [41] [42], correlations [43], causality [44], sequential pat-

terns [45], multidimensional patterns [46], episodes and emerging pattern [47] [48]

and many other patterns in large databases. Han et al. [49] in their data mining

book give the de�nition of pattern as: �a set of items, subsequences, or sub-

structures that occur frequently together (or strongly correlated) in a dataset.�

Patterns can be discovered from a large dataset. There are two major types of

pattern mining that exist, such as itemset pattern mining and sequence pattern

mining. Each of these techniques can be further categorized as frequent and rare

pattern mining.

The data mining techniques extract hidden patterns from large volumes of

data. The pattern is sequential when the data is represented with a sequence

or time-related format. The sequence is comprised of a set of transactions and

2.6. Pattern Mining 27

Figure 2.8: A sequence diagram.

each transaction is composed of a set of events or items. The sequence of trans-

action and events are shown in Figure 2.8. The pattern analysis based on the

sequence database is regarded as sequential pattern mining. This technique was

�rst introduced by the authors Agrawal and Srikant [50] in 1995 and de�ned as

follows:

�Given a database of sequences, where each sequence consists of a list of

transactions ordered by transaction time and each transaction is a set of items,

sequential pattern mining is to discover all sequential patterns with a user spec-

i�ed minimum support, where the support of a pattern is the number of data

sequences that contain the pattern.�

There has been much research in algorithm development for sequential pat-

tern mining. These algorithms can be categorized into two di�erent broad groups

(i) Apriori-based algorithms (ii) Pattern growth-based algorithms [51] [52]. Apri-

ori based algorithms apply breadth �rst or level-wise search techniques. This

technique generates many candidate sequences which could be exponential in

the worst case scenario. Most of the candidate sequences are useless and unde-

sirable. Therefore, these undesirable sequences are needed to be pruned. Apriori

algorithm also applies multiple scans on the sequence database that increases

processing time [51] [52]. The examples of Apriori based algorithms are Gener-

alized Sequential Patterns (GSP) which is based on horizontal database format

and Sequential Pattern Discovery using Equivalent classes (SPADE) which is

based on vertical database format.

To overcome the drawbacks of Apriori based algorithms, specially for remov-

ing candidate generation, the Frequent Pattern Growth (FP-Growth) algorithm

was introduced. This algorithm applies the divide and conquer method [53] [54].


It converts the sequence database into a frequent pattern tree and is faster in

operation with large data compared to the Apriori algorithm [51]. The PREFIX-

projected Sequential PAtterN mining (Pre�xSpan) is a widely used example of

the FP-Growth algorithm. Sequential pattern mining has a wide range of ap-

plications. For example, �nding customer buying patterns to o�er them new

products, redesigning company's web site after analyzing customers' browsing

patterns, �nding DNA sequences. As a novel approach, this thesis proposes and

develops a rare sequential pattern mining algorithm using Apriori-based method.

2.6.1 Itemset Pattern Mining

An itemset pattern mining algorithm extracts interesting and useful patterns

from a transaction database. This concept was �rst introduced by Agrawal and

Srikant [55] where they discovered a group of items in a customer transaction

database that were frequently purchased together. A transaction database D =

{T1, T2, ..., Tn} is a set of transactions such that each transaction Tq ⊆ I (1 ≤q ≤ m) is a set of distinct items [56]. Each transaction Tq can be identi�ed with

a unique identi�er called a Transaction ID (TID). An example of a customer

transaction database is given in Table 2.1. The example transaction database

Table 2.1: Transaction database TDB.

Transaction ID TransactionTID1 {a, b, d, e}TID2 {c, d, f}TID3 {b, c, d, f}TID4 {c, f}TID5 {a, b, c, e, f}

has �ve transactions named TID1 to TID5. Each transaction comprises a set

of items or an itemset. An itemset X is a set of items such that X ⊆ I. For

example, {a, b, d, e} is an itemset shown in the transaction ID TID1 in Table

2.1. This itemset is composed of four items that are purchased by a customer in

one transaction. In a transaction the items do not keep the order in which these

items are purchased. In other words, from an itemset we cannot �nd the order

in which the items were purchased. Therefore, there is no di�erence between

the itemset {a, b, d, e} and {b, a, e, d} because these two itemsets hold the same

items. Further, if an item is purchased in multiple numbers, these items are


listed as one item. For example, in the itemset in transaction TID4, if the item

c were purchased in 4 times, this item is counted as one item.

The task of itemset pattern mining is to �nd all the itemsets that appear

together frequently in a transaction database. The initial purpose of itemset

pattern mining was to analyse the market basket to promote the sales of items

by arranging the items next to each other that are bought together. For ex-

ample, the itemset {c, f } in the transaction Table 2.1 is found to appear most

frequently purchased itemset. As a result, these items can be placed next to

each other on the shelf increasing their sales. Ever since the market basket ap-

plication, the application of itemset pattern mining has been extended to many

di�erent domains, such as product recommendation, text mining, bioinformat-

ics, e-learning, web page analysis, network tra�c analysis, image classi�cation

[57] [58] [59]. There have been many algorithms developed for itemset pattern

mining, such as Apriori [55], FP-Growth [60], Eclat [61], FIN [62].

2.6.2 Sequential Pattern Mining

In itemset pattern mining, the order of events is not considered although in

some applications the order of items or events are important. Therefore, item-

set pattern mining cannot be applicable to extract the desired information from

database where the order of events is important. For example, in �nding intru-

sions in a SCADA network the order of the events is important [63]. This is

because the change in the order of events can cause the SCADA process control

system to malfunction. For example, in a water tank control system a pre-de�ned

ordered event �lls up a water tank to its maximum threshold capacity. Then the

water is drained from the tank and when it reaches its lower threshold level, the

water pump turns on and again �lls the tank to its maximum capacity. The

ordered events are as follows:

〈{Close_V alve}, {Pump_On}, {Check_Water_Max_Threshold_V alue},{Pump_Off}, {Open_V alve}〉

These ordered events continue as a regular process of the water tank control

system. However, if there occurs a change in the order of the events from the

regular process of the water tank control system, the water tank system �oods.

The altered ordered events are given below:


〈{Open_V alve}, {Check_Water_Max_Threshold_V alue}, {Pump_Off},{Close_V alve}, {Pump_On}〉

In text mining considering the order of the words in a sentence is important [64].

In sequential pattern mining there are two types of data that can be used. The

�rst one is time-series data and the second is sequence data. Time-series data

has nominal values and sequence data has symbolic values [65]. An example of

time-series data could be stock prices in the capital market, user consumption

of electricity. An example of sequence pattern mining could be sequence of web

clicks by a user while visiting a website, purchasing a sequence of items by a

customer over a period of time.

Sequential pattern mining is an active research application area and this

technique has been used in di�erent application areas, such as market basket

analysis, bioinformatics, e-learning, text mining, web-click stream analysis [65].

Assume that a set of items or symbols I ={i1, i2, ..., im}. An itemset X is a set

of items such that X ⊆ I. A sequence is an ordered list of items s = 〈I1, I2, ..., In〉where Ik ⊆ I (1 ≤ k ≤ n). For example, 〈{a}, {b, c}, {d}, {e, f}〉 is a sequence ofitems purchased by a customer over a period of time. The customer purchased

item {a} in the �rst transaction. After some time, in the second transaction the

customer bought two items {b,c} at the same time, then purchased item {a}

and �nally purchased item {e,f} in the last transaction. This sequence has 4

transactions that indicates a customer has purchased items in a sequential order

at a di�erent time. Therefore, the size of the sequence is 4. However, the length

of the sequence is 6 that indicates the number of items the customer purchased

over the time period.

A sequence database SDB is composed of a list of sequences, that is, SDB =

〈s1, s2, ..., sp〉. An example of a sequence database is given in Table 2.2. There

Table 2.2: A sequential database SDB

Sequence ID SequencesSID1 〈{a}, {b, c}, {d}〉SID2 〈{a}, {e}, {f}, {b}〉SID3 〈{a, b}, {d}, {c}〉SID4 〈{b}, {c, e}, {e}〉SID5 〈{c}, {b}, {e}, {d}〉


are 5 sequences in the sequence database SDB in Table 2.2. Each sequence is

identi�ed with a Sequence ID (SID) number. The purpose of sequential pattern

mining is to �nd a subsequence that is interesting to the user. A sequence is called

a subsequence if a 〈a1, a2, ..., an〉 is contained in another sequence 〈b1, b2, ..., bm〉where (m ≥ n), if there exist integers 1 ≤ i1 < i2 < ... < in ≤ m such that

a1 ⊆ bi1 , a2 ⊆ bi1 , ..., an ⊆ bin . The interest in a subsequence could be measured

by the frequency of the subsequence. A subsequence that is rare in the database

SDB could also be interesting to some users depending on the interest of an

application domain.

2.6.2.1 Frequent Sequential Pattern

In frequent sequential pattern mining, a sequence is considered to be frequent if

the frequency or support value of a sequence is greater than or equal to the user

provided threshold, called the minimum support value minsup [66]. For example,

the sequence 〈{b}, {d}〉 is a frequent subsequence in the sequence database SDB

in Table 2.2 when the user provided minimum support value is set to minsup =

2. The support of the sequence 〈{b}, {d}〉 is 3 because this sequence is found in 3sequences SID1, SID3 and SID5. Similarly, the sequence 〈{a}〉 is also a frequentpattern since it appears in 3 sequences in the database SDB, that is, its support

value is greater than the minsup = 2 value. In some application domain, the

frequent pattern is considered interesting because the frequent pattern represent

the regular or expected behavioural pattern of a system. There have been many

algorithms developed to �nd frequent patterns from sequence databases. Some

of the most popular algorithms are Srikant and Agrawal's [66] GSP algorithm,

Zakis's [67] SPADE algorithm, Pei et al.'s [45] Pre�xSpan algorithm. All of these

algorithms �nd all the frequent subsequences that are considered interesting to

the users. These algorithms can be categorised into two groups; Apriori or level-

wise algorithm and pattern growth algorithm. An example of Apriori algorithm

is GSP algorithm and pattern growth algorithm is Pre�xSpan algorithm. The

pattern growth algorithm is computationally faster than the Apriori algorithm

because pattern growth algorithm does not generate candidate sequences unlike

the Apriori algorithm that generates candidate sequence.


2.6.2.2 Rare Sequential Pattern

In rare sequential pattern mining, a sequence is considered rare if the support

value of a sequence is lower than the user de�ned maximum support value max-

sup. For example, the sequence 〈{a}, {d}, {c}〉 is a rare sequence because it

appears in the sequence database SDB in Table 2.2 only once. In other words,

this sequence is found only in the sequence in SID3. This sequence is rare be-

cause its support value is below the maxsup value set to 2. In some application

domains, rare cases are interesting to the users. For example, rare symptoms

of a disease are important to identify the disease. Another example of a rare

case that could be suspicious and interesting is a �re alarm system. In normal

circumstances, whenever there is a smoke detected in a system, the �re alarm

goes on. In other words, the smoke detected event is followed by the �re alarm

triggered event which is the correct behaviour of the system. However, if smoke

that has been detected, but the �re alarm did not trigger then these two se-

quences of events together is a irregular rare behaviour of the system. Although

there have been many algorithms developed in the literature to �nd frequent se-

quential patterns, there exist no algorithms to �nd rare sequential patterns from

the sequential database.

2.7 Constraint-based Pattern Mining

Generally constraints can be considered as a preference or restrictive parameter

that can be incorporated to discover or extract expected useful information from

databases. It means that, constraints are parameters that can be used to limit

the access and process data to �nd user interested information. In other words,

constraints can be used at the data source while choosing the target databases as

well as at the algorithmic level during the data processing stage. For example,

to extract a meaningful or signi�cant pattern that can be generated from a

particular time period data or an episodic period of data, we need to select the

data source that has episodic characteristics. It means that there would be some

�xed or irregular time gap between the episodic event or activities in the system.

In the episodic time period the system accomplishes a complete task and only

then data is recorded in the logs. But, when there exist no activities in the

system, there is no data recorded in the logs. Therefore, during data source

selection, we need to choose a time gap in episodic data instead of continuous

2.7. Constraint-based Pattern Mining 33

data where there is no time gap. The example of algorithmic constraint could

be not to allow a pattern size to exceed a prede�ned size. It means if a pattern

exceeds the size constraint, it can no longer be a useful pattern.

The constraints can reduce the search space in the database and can �lter

the results while extracting the required information [68]. Most of the time the

use of constraints is associated with pattern mining tasks. In pattern mining,

particularly in sequential pattern mining, a large number of results (patterns) are

generated. It is hard to analyse and discover the desired pattern from these large

number of patterns [69] [70]. Most of the patterns are useless or unimportant

because these patterns cannot be used to identify user required information [71].

However, a large amount of computational time is required to process these

unimportant patterns during the pattern mining process. To reduce this time

and �lter out the unnecessary patterns, domain related knowledge is needed to

extract their desired result.

Pei et al. [72] gives the de�nition of a constraint C as follows: A constraint

C is a predicate on the powerset of the set of items I, that is, C : 2I ⇒ {true,

false}. Here, I = {i1, i2, ..., im} be a set of items. A sequence S satis�es a

constraint C if and only if C (S ) is true. For example, to �nd a constrained

rare sequential pattern, the following condition needs to be satis�ed, that is,

sup(S) ≤ maxsup ∧ C(S) = true. Assume the user de�ned maximum support

maxsup is set to 2 and the constraint C is the size of a rare pattern, that is,

the number of events in a rare pattern that should be no greater than 3 events.

Then the rare sequential pattern 〈{a}, {e}, {b}〉 from the Table 2.2 can be called

a constrained pattern. This is because this pattern satis�es the above mentioned

constrained condition.

There have been many constraints proposed in the literature. These con-

straints can be categorised into di�erent groups based on semantics, properties

and nature of the data source [68]. The semantic based constraints depends on

the interest of the application domains. For example, item constraints that can

be used to extract a particular item or a group of items in a pattern. An instance

of an itemset constrain could be, assume a drug manufacturing company is in-

terested in to �nding patterns that contain particular drugs while mining their

warehouse database. The property based constraint relies on the behavioural

characteristics of items when it is added to an itemset or removed from an item-

set. For example, if a pattern is found rare in a database, any pattern that


grows from this rare pattern, meaning any superpattern, would always be a rare

pattern. In other words, an item added to the rare pattern to make it a su-

perpattern does not change the properties of the rare pattern. This property is

called an anti-monotonic constraint. The type of data source may also in�uence

the selection of constraint. For example, some data sources are continuous while

some other data sources are episodic.

There have been many algorithms developed for constrained pattern mining.

Srikant and Agrawal [66] �rst introduced a constraint based algorithm called

Generalised Sequential Pattern (GSP). In this algorithm, the time gap, meaning

the time di�erence between two consecutive events, constraint and time span

constraints, that is, the time di�erence between the �rst event and the last event

in a pattern, have been used. Over time, there have been other algorithms

that have developed or extended the GSP algorithm. Mannila et al. [48] used

the width of a time window constraint to �nd frequent episodic patterns. This

algorithm �nds the patterns whose events are within the time window constraint.

For example, if a sequential pattern 〈{a}, {b, c}, {d}〉 that contains three events{a}, {b,c} and {d}. If these three events in a pattern occur within a set time

window constraint of 5 minutes, the pattern is called a time window constrained

pattern. In other words, all these events happen inside the 5 minutes time period.

Like Mannila et al.'s [48] time window method, our proposed method used time

span constraint. However, we used time-span constraint to �nd rare episodic

pattern unlike �nding frequent episodic pattern used by Manila et al. [48].

Chen et al. [73] applies time interval of events as constraints while discovering

sequential patterns. It means that in a pattern if the time interval or di�erence

between the consecutive events satis�es the set time interval constraints, the

pattern is considered signi�cant and hence discovered from the database. Chen

and Hu [71] introduced two constraints recency and compactness in sequential

pattern mining. The recency constraint discovers most recent patterns because

the behaviour of system may change over time. On the other hand, the com-

pactness constraint �nds patterns from a de�ned time span period which is the

similar idea used by Mannila et al. [48]. Another constraint algorithm, Sequen-

tial Pattern mIning with Regular expressIons consTraints (SPIRIT ), uses regular

expressions as constraints [74]. Later, following the SPIRIT idea, Antunes and

Oliveira [75] developed an algorithm to infer association rules using context free

grammar as constraints. Algorithms usually integrate constraints in the mining

2.8. Association Rule Mining 35

process. In other words, constraints can be directly associated with the actual

pattern search process [76]. For example, Zaki's [77] cSPADE constraint-based

algorithm where constraints such as the restriction of length and width of a

pattern were integrated in the mining process.

In a similar approach, we have used pattern size constraint to reduce the

number of comparisons while searching a candidate sequence pattern (sub pat-

tern) exist in an episodic sequence. If the size of the candidate sequence pattern

is larger than the size of an episodic sequence, the candidate sequence cannot be

found. Hence, the searching process can be skipped which results in reduction in

comparison. So, the pattern size constraint improves the time e�ciency of the

proposed rare sequential pattern mining algorithm. we have also used another

constraint called pattern existence constraint which is unique to further reduce

the searching time, �nding candidate sequence pattern in an episodic sequence,

of our proposed rare sequential pattern mining method. When the frequency

of a candidate sequence equals the maximum support threshold value, there is

no need to further searching the candidate sequence in the sequence database.

This is because the candidate sequence frequency cannot exceed the maximum

support threshold value. Therefore, unwanted scanning is avoided which helps

to reduce the searching time.

2.8 Association Rule Mining

Association rule mining is a rule-based machine learning method that discovers

interesting relations such as correlation, association or causal structures between

data in large databases [78]. This method is used to �nd a strong rule involving

two sets of items that indicate that if one itemset occurs then the other itemset

also occurs in the dataset. Agrawal et al. [79] �rst applies the association analysis

method using point of sale data called market basket analysis to discover the

purchasing behaviour of customers. An example of a market basket transaction

database is given in Table 2.3. For example, the association rule {Bread, Milk}

⇒ {Diapers} can be extracted from the database shown in Table 2.3. This

rule suggests that there exists a strong association between the itemset {Bread,

Milk} and itemset {Diapers}. In other words, this rule indicates that many

customers who bought Bread and Milk together and also bought Diapers in the

same transaction. Therefore, this information helps to make marketing decisions


Table 2.3: A market basket transaction database.

Transaction ID Items PurchasedTID1 {Bread,Milk,Diapers}TID2 {Bread,Diapers,Beer, Eggs}TID3 {Milk,Diapers,Bread,Cola}TID4 {Diapers,Beer,Bread,Milk}TID5 {Bread,Milk,Diapers,Cola}

such as promoting promotional pricing and inventory management placing these

items close to each other in the shelf. There have been many applications of the

association rule apart from market basket analysis. For example, bioinformatics,

web usage mining, intrusion detection, medical diagnosis [80].

An association rule is denoted by using the implication expression X ⇒ Y.

Here, X is called the antecedent and Y is called the consequent of the rule. In

this rule, the antecedent X and consequent Y are disjoint itemsets, meaning X ∩Y = ∅. To measure the strength of an association rule between the two itemsets

can be determined by di�erent parameters or metrics. These parameters could

be support, con�dence, lift, conviction. These parameters determine the interest

of a rule from a set of possible rules. The most popular metrics for identifying

interesting rules are support and con�dence. Using these metrics we can focus

on the interesting rules and disregard or eliminate uninteresting rules.

The Support indicates how frequently an itemset appears in a database. The

support can be formally de�ned as:

Support, s(X ⇒ Y)= support(X∪Y )N

Assume that the itemset X be {Bread, Milk} and Y is {Diapers} and N is the

number of transaction, that is, 5 in the database in Table 2.3. The support count

for {Bread, Milk, Diapers} is 4 since this itemset appears in 4 transactions TID1,

TID3, TID4 and TID5 and the total number of transactions is 5. Therefore, the

support of this rule is 4/5 = 0.8. In other words, this rule occurs in 80% of the

all transactions.

The Con�dence determines how often a rule has been found to be true in

a database. In other words, the con�dence indicates how often the itemset Y

appears in the transaction of a database where the itemset X also appears. The

con�dence can be formally de�ned as:

2.8. Association Rule Mining 37

Con�dence, c(X ⇒ Y)= support(X∪Y )support(X)

The con�dence of the rule {Bread, Milk} ⇒ {Diapers} can be calculated by

dividing the support count of the itemset {Bread, Milk, Diapers} which is 4 by

the support count of the itemset {Bread, Milk} which is 4 because this itemset

appears in 4 transactions TID1, TID3, TID4 and TID5 of the database in Table

2.3. Therefore, the con�dence of the rule is 4/4 = 1.0. In a 100% transaction,

whenever a customer bought Bread and Milk, they also bought Diapers.

Pattern mining is the �rst step of association rule mining. An association

rule �nds a correlation among the items in a pattern which is not possible in

a pattern mining approach. This is because in pattern mining the items are

listed as a set of items. However, in an association rule items are listed in a

two distinct correlated itemset. For example, in pattern mining {Bread, Milk,

Diapers} is a frequent pattern when the support value is 2 as shown in Table 2.3.

It is not possible to say from the above pattern whether any item tend to cooccur

with other items in a single transaction. On the other hand, in the association

rule {Bread, Milk} ⇒ {Diapers} which is generated from the previous frequent

pattern, it is possible to say that if items Bread and Milk are purchased in a

single transaction, it is likely that Diapers also will be purchased in the same

transaction.

The association rule can be categorised into two groups based on the order of

the antecedent and the consequent of a rule. These two categories of association

rules are (i) the Itemset Association Rule and (ii) the Sequential Association

Rule. When the order between the antecedent and the consequent of a rule is

not considered, it can be considered an itemset association rule. For example,

{Bread, Milk} ⇒ {Diapers} in the previous discussion is an itemset association

rule. This is because the rule does not say anything about the order between the

antecedent itemset {Bread, Milk} and the consequent itemset {Diapers}. The

rule does not indicate which itemset was purchased before the other itemset.

Rather the rule says that the itemset comprising both antecedent and consequent

{Bread, Milk, Diapers} is purchased together in a single transaction.

On the other hand, the sequential association rule mining considers the order

between the antecedent itemset and the consequent itemset. The rule X ⇒ Y is

considered a sequential association rule if and only if the antecedent is followed by


the consequent. However, the order of the items in the antecedent and consequent

itemsets is not considered. For example, the sequential association rule {a, b}

⇒ {c, d} indicates that the itemset {a, b} occurs before the itemset {c, d} in a

database with the antecedent is followed by the consequent. However, the order

of the items in the antecedent {a, b} and in consequent {c, d} does not matter in

the sequential association rule. The interestingness of the sequential association

rule mining can be measured with the same metrics support and con�dence that

are used in itemset association rule mining.

Sequential association rules can be grouped into two categories [81]. In the

�rst category, the order of events in a sequence is maintained both in the an-

tecedent and the consequent of a sequential rule. In addition, the order is main-

tained between the antecedent and the consequent. On the other hand, in the

second category, the order of events inside the antecedent and the consequent is

not maintained. However, the order is maintained between the antecedent and

the consequent. This thesis uses the sequential association rule mining where

the order of events are maintained both inside the antecedent and consequent as

well as between the antecedent and the consequent. This is because the logs of

the SCADA control system is sequential in with timestamps tagged with each

event.

Sequential association rule mining can be used in di�erent application domain

such as in mobile telecom industry. In telecom industry, sequential association

rules can be used to reduce the large number of alarms which allows the net-

work operator to selecting the most important alarm [82]. This method allows

the operator to take precaution in advance to reduce the possible consequence

or system disruption. Another example of the application of sequential asso-

ciation rule mining is predicting future energy demand [83]. Authors in [76]

establish correlation between the consumers household activity and the electric-

ity consumption to �nd the future electricity demand pattern. In web mining,

sequential association rules can be used to predict the future HTTP request in a

web site [84]. Yong et al. [84] uses web page usage patterns to mine association

rules using sequence and temporal constraint. Although association rules mining

can be used to predict a future event based on the current trend of the pattern,

no research has been done to predict a future event or sequence of events which

is a part of anomalous sequence.

2.9. Existing Anomaly Detection in SCADA System 39

2.9 Existing Anomaly Detection in SCADA Sys-

tem

Research has already been done to �nd anomalous behavior in SCADA systems

from diverse perspectives. The existing works can be analysed according to the

following categories: (i) Log based anomaly detection techniques, (ii) Protocol

based anomaly detection techniques, (iii) Network tra�c based anomaly detec-

tion techniques, (iv) Data mining based anomaly detection techniques and (v)

Rare pattern mining based anomaly detection techniques.

(i) Log based anomaly detection techniques : Naedele el al. [85] proposes to

aggregate log resources from distributed process control environment to

analyze and present security status in visual forms to a non IT opera-

tor monitoring the process to detect abnormal condition using his domain

expertise. Naedele el al. also combines human experience with IDSs tech-

niques to reduce the false positive alarm rate which is normally very high

with rule based as well as anomaly based techniques. Balducelli et al. [86]

attempted to �nd abnormal behavior using a case based reasoning method.

They tried to compare the sequence of SCADA log events with the pre-

viously de�ned normal behavior. However, they did not work with real

SCADA log �les rather they experimented in a simulated test-bed envi-

ronment. Infrastructure appliances can record operational activities of the

systems to a system log �le in a de�ned format. Moreover, any successful

or attempted unauthorized access or intrusion into the system can also be

recorded into log �les that can be analyzed for intrusion or fault detection

by the security experts or network administrator. Therefore, logs or event

logs could be a good data source to preprocess and analyze for detecting

attacks or intrusions happened in the system. Vaarandi [87] uses event logs

from network devices like routers to identify anomalies. The author uses

his proposed clustering algorithm, Simple Log File Clustering Tool (SLCT),

to separate outliers from the normal system pro�le. The tool checks each

line of the event log and if the event line is found to be similar to the

normal system pro�le event logs, it is separated into a normal group. How-

ever, if the event line does not fall into the normal system pro�le group, it

is considered as an outlier and put into an outlier group. Felix et al. [88]

states that di�erent systems maintain their own formatted logs which make


it hard to understand and process logs from outside of the domain. But

Garitano et al. [15] state that SCADA systems have a number of unique

features such as deterministic communication and limited and recurrent

activities. Therefore, modelling anomaly detection for SCADA systems is

comparatively easier than the traditional computer networks. In addition,

Hadºiosmanovi£ et al. [14] also mention that compared to other logs used

in di�erent domains that SCADA logs are in good format to analyze.

(ii) Protocol based anomaly detection techniques : It is di�cult for network ad-

ministrators to manually analyze the volumes of log entries in SCADA

systems. Therefore, a data mining approach is needed to analyses large vol-

umes of data to �nd hidden, useful patterns and detect anomalous events

or attacks. In the past, anomaly detection systems primarily applied two

techniques: signature based and anomaly based methods for general en-

terprise networks. Later on, these detection techniques were adopted or

migrated into the control system network like SCADA. Usually anomaly

based detection techniques are not used in enterprise networks because of

its rapidly changing behavior [89]. However, SCADA networks or control

networks usually do not change rapidly in terms of topology, protocols,

actions and communication patterns [17] [90] [91] [92]. Therefore, we ar-

gue that anomaly detection techniques would e�ectively be able to identify

anomalies in control system networks. This technique usually learns the

normal behavior of a system and gives alert alarms if any deviation from

the regular system behavior is found [92].

(iii) Network tra�c based anomaly detection techniques : There have been much

research on anomaly detection from diverse perspectives. Some operate at

communication protocol level [90] [91] and analyse on SCADA logs for

threat identi�cation in SCADA processes. In most cases, signature based

techniques are used for anomaly detection due to their high detection rate

and low false rate. Some tools are available for free to use such as Snort,

Sensors, Net-Rangers and some are for commercial use like Real-Secure [93]

[26]. These tools collect network tra�c data and match it with a prede-

�ned pattern in the database to identify any deviation and thus identify

an occurrence of an anomaly into the system. These tools usually use sig-

nature based techniques to detect any suspicious activities and refer to the

analysts for further investigation. This type of software requires periodic

2.9. Existing Anomaly Detection in SCADA System 41

updates of its database and hence works well in small network environments

[94]. However, attackers are constantly updating techniques to address the

defensive security measures and this technique cannot cope with a large

number of attacks and raises large volumes of alert alarms for the security

experts to further investigation.

(iv) Data mining based anomaly detection techniques : Data mining can be used

to �nd the hidden patterns (regularities and irregularities) from big data

generated by ICS like SCADA that can help detect anomalies in a system.

Currently data mining methods cannot identify anomalies from constantly

changing datasets [94]. However, as SCADA log data is almost steady

we are optimistic of detecting anomalies using data mining techniques.

Manganaris et al. [26] shows that the absence of frequent events or set

of events can be considered as an anomaly. Clifton et al. [27] applied

data mining techniques to identify normal behavior of a system based on

the frequent occurrence of an alarm event and later �ltered out suspicious

events lists. Barbara et al. [95] analyse system and user behavior using

data mining association rules from network tra�c data to train a model.

They look for any deviation in association rules considered as abnormal

behavior.

(v) Rare pattern mining based anomaly detection techniques : Although there

have been some work in �nding infrequent itemset mining, they do not

consider addressing the order of the itemsets or events. Until now there

has been no research in the literature in rare pattern mining that consid-

ers events in sequential order to detect anomalies. However, our research

focuses on maintaining the order of the events because the sequence or-

der would help to �nd causal relationships among the events. A break of

sequence order represents anomalous events or attacks. The research in

the rare pattern mining �eld started in late 1990s. This area has recently

attracted a lot of attention from researchers due to its increasing demand

for application in anomaly detection in network security, medicine, genetics

and molecular biology [96].

Saha et al. [97] mentions the basic strategy for rare pattern mining is to

identify all the frequent patterns from a transaction database using a user

de�ned threshold and later pruning these patterns from the database. As


a result the remaining patterns fall below the support value considered as

infrequent. Szathmary et al. [98] discovered rare itemsets by identifying

minimal rare itemset generators. Their motivation is to identify individual

frequent items. If the combination of these frequent items becomes infre-

quent then this combination is considered a rare itemset. For example,

items vegetarian {veg} and cardio vascular disease {cvd} are individually

frequent items but when they are combined, the combination {veg, cvd}

is infrequent then {veg, cvd} becomes a rare item set. However, their

co-occurrence does not carry signi�cant meaning, in other words, it is con-

sidered unlikely to have cvd for the veg people.

We are inspired by the motivation of the aforementioned work. Although

this work was done for rare itemset mining, we have ventured into �nd-

ing rare or infrequent sequential patterns where the order of itemsets or

events are preserved. The order of the events helps to �nd causal rela-

tionship among the events and that is missing in Szathmary et al.'s work.

For example, Fan-Failure and Device-Down are two events and between

these two events there is a causal relationship such that Fan-Failure leads

to Device-Down [99]. In other words, the occurrence of the event Fan-

Failure causes the second event Device-Down to happen. Therefore, these

two events {Fan-Failue, Device-Down} carries signi�cant meaningful infor-

mation. However, if these two events occur in the order as {Device-Down,

Fan-Failue} then it does not carry meaningful information regarding which

event causes other event to occur.

Therefore, infrequent or rare sequential patterns need to be further ana-

lyzed to discover anomalous events which are also rare in actions. It is

worth mentioning, that if a rare event is performed several times then it

becomes a frequent event and would be considered as a normal or regular

event of the system. In that case, this technique might not work prop-

erly. Usually the frequent sequential patterns are those sequences that

satisfy the user de�ned minimum support threshold value. However, the

remaining sequences that fall below the threshold value are considered as

infrequent sequential patterns which are usually ignored as they do not

represent the regular or normal behavior of the system. The current litera-

ture only focuses on frequent patterns and ignores infrequent patterns. Our

research will use these infrequent sequential patterns to detect anomalies.

2.10. Summary and Research Gaps 43

The closest matching of this research work is in Szathmary et al.'s work

[98]. However, Szathmary et al. did not consider the events' sequence order

and as a result there is no correlation among the events and this will be

addressed in this research work.

2.10 Summary and Research Gaps

This chapter has reviewed literature concerning anomaly detection techniques in

critical infrastructures with diverse data sources. The data used is mainly com-

munication protocols, network tra�cs and a few cases log �les. The methods

used to detect anomalies analyzing this data varied from case based reasoning,

human expertise with tradition IT analysis to data mining techniques. Most

of the research was conducted using signature based techniques which can only

identify previously known attack patterns. However, attackers are using con-

stantly changing new attack techniques or patterns which these methods fail to

identify. Therefore, our research will address these shortcomings and develop a

behavior based anomaly detection model that will not only identify known at-

tacks but also detect new or unknown attacks. In addition, the developed model

will generate an early prediction of possible anomalies in the system.

This chapter has also reviewed the basic building blocks of SCADA architec-

ture, control device logs that record SCADA process activities. Based on this

review, it has been found that SCADA control systems play a critical role in

monitoring and controlling critical infrastructures. However, SCADA systems

have been the target of cyber attacks to disrupt the process activities of the con-

trol system. This disruption may cause environmental as well as �nancial losses

which is a great concern for communities that use SCADA.

Therefore, there is a need to detect anomalies and intrusion in SCADA control

systems. It has been found that data mining in general and pattern mining in

particular can play a signi�cant roles. This is because of the large volume of

logs generated by the SCADA control systems that can be e�ectively analysed

by data mining algorithms.


Three research gaps have been identi�ed from the literature. A brief description

of these research gaps are:

1st Research Gap: There exist no anomaly detection method maintaining the

order of events which can identify an abnormal behaviour from SCADA control

logs. In SCADA control system events occur in sequential manner. The normal

or regular ordered sequence of events accomplish a complete process, while the

irregular ordered sequence of events disrupts the normal process. For example,

the following ordered sequence of events in a water tank control system accom-

plish a complete process.

〈{Close_V alve}, {Pump_On}, {Check_Water_Max_Threshold_V alue},{Pump_Off}, {Open_V alve}〉

These ordered sequence of events �lls up a water tank to its maximum threshold

capacity. Once the threshold is reached, the water is released from the tank.

When the water level touches the lower threshold value, the above ordered se-

quence starts to �ll the water tank. This process continues for a de�ned time

period. However, if the events in the above sequence is performed with following

di�erent ordered sequence, the water tank system over�ows and �oods the sys-

tem.

〈{Open_V alve}, {Check_Water_Max_Threshold_V alue}, {Pump_Off},{Close_V alve}, {Pump_On}〉

This is because the changes of order of events which deviates from the regular

process of the water tank control system.

Although there exists a single work by Hadºiosmanovi£ et al. [17] that �nds

anomalies from a water treatment control system logs, that work used itemset

pattern mining to identify the abnormal or unusual system usages of a system

user. Hadºiosmanovi£ et al's. work �nds a single unusual event with the help of

domain expert knowledge. However, since SCADA control system events occur

in sequential manner, the order of the events are important in distinguishing

between the noraml behaviour against the abnormal behaviour of the SCADA

control system. For example, if the events a, b, and c happen in the SCADA


system in the order of event a followed by event b further followed by event c

then the sequence is called the regular pro�le of the system. However, if there is

a change in the order of events such as event a followed by c then followed by b

then the sequence is called abnormal, which can be considered the anomalies in

the system.

In Chapter 3 of this thesis we �nd anomalies in SCADA control system logs.

We use a rare sequential pattern mining approach to detect anomalies. The rea-

son for using rare sequential pattern mining method is to verify if the notion of

anomalies happen rarely in a system. To �nd rare sequence of events, this re-

search developed an algorithm that can generate all rare sequential patterns in a

sequence database. Chapter 3 also generates the equivalence class that separates

all rare patterns into di�erent groups comprising the minimal rare sequential

pattern, the maximal rare sequential pattern and the all the patterns in between

the minimal and maximal pattern. Based on the domain interest, sometimes it

is the minimal pattern that can help to identify the anomalies and sometimes it

is the maximal pattern that can better help to detect root cause of anomalies in

the system.

2nd Research Gap: Analysing a rare sequential pattern mining algorithm with

the integration of constraints. There is a need to investigate constrained rare se-

quential pattern mining. The �rst reason is to �nd rare pattern from an episodic

time period or a session period. If a pattern originates from a session time then

the events in the pattern have the capability to carry out a complete task. Other-

wise, if the pattern is composed to the events coming from consecutive sequences

that are beyond the session time, the pattern may lose its strength or capability

to conduct an accomplished task.

The second reason is to improved the e�ciency of the rare sequential pat-

tern mining algorithm in terms of reducing the unimportant rare patterns and

minimizing the computational time of the algorithm while generating the desired

pattern. These can be achieved by integrating constraints to �lter out unneces-

sary patterns, which can help to reduce the database search space. Therefore,

the minimized search space can attribute to less computational time in the al-

gorithm. The third reason is to reduce the false positive rate, which can be

achieved by identifying anomalous pattern from a small number of suspicious

rare patterns while keeping the accuracy of the algorithm.


In Chapter 4, this research �nds anomalous patterns that belong to an episodic

session time period. The events of a pattern do not fragment apart a consecutive

sequence. If the events of a pattern happen within a session time period, the

pattern is signi�cant that it has the strength or capacity to do harm to a system.

In this chapter, we also reduce the complexity of the rare sequential pattern

mining algorithm that we developed in Chapter 3. The complexity is improved

by improving the e�ciency of the algorithm in terms of reducing the number

of rare sequential patterns generated and the computational time the algorithm

takes to generate these patterns. The e�ciency is improved by integrating con-

straints into the algorithm. There is a trade o� between the e�ciency and the

accuracy. However, the accuracy of the algorithm does not decrease for anomaly

detection. This is due to careful use of the constraints with the help of domain

expert knowledge.

3rd Research Gap: The absence of anomaly prediction method which can alert

the possible incoming anomalies on SCADA live system. The connection of

modern SCADA systems to the internet exposes them to cyber-attacks. To

detect cyber-attack, one of the methods is to analyse the static o�-line logs

to understand the nature of the attack once it has occurred on the system.

Although this method can identify the attack, it cannot protect the system from

the occurrence of an attack. This is because this method is unable to provided

possible attack information before the attack happens. As a result, it is neither

possible to protect the system from the attack, nor it can provide possible attack

information to take precautions so that the attack can be avoided. Therefore,

there is a need for an anomaly prediction method which can raise an alert for

a possible anomaly on the SCADA live system. In Chapter 5 of this thesis, the

research uses association rules generated from rare sequential patterns to predict

an incoming or ongoing anomalies or attacks in the system by analysing SCADA

streaming logs. If the antecedent of an association rule can be found in the

incoming streaming logs, it is predicted that the consequent of that rule is likely

to happen in near future. The antecedent comprising a sequence of events is

considered as the precursor, meaning it gives an indication of the consequent of

the rule is likely to follow. If the consequent which is also an event or sequence of

events that occur within a de�ned time period or session time in the log stream,

then the prediction becomes true.


The following chapter, Chapter 3, presents the �rst contribution of this thesis,

which addresses the �rst research gap �There are no anomaly detection methods

involving SCADA control logs using rare sequential pattern mining method.

Chapter 3

A Rare Sequential Pattern Mining

Approach for Anomaly Detection

3.1 Introduction

Pattern mining is one branch of data mining research focused on discovering un-

derlying useful information from a large database using di�erent methods [100].

Among these techniques, some deal with itemset pattern mining and others are

used for mining sequential patterns. In both categories, frequent pattern min-

ing is a widely practiced research domain because it can discover the regular or

expected behaviour in data. However, to reduce computation requirements to

achievable levels, this technique ignores a large and sometimes very interesting

segment of the database, that is, the infrequent or rare behavioural patterns

which could be interesting and signi�cant.

It is believed that in some circumstances rarity in data gives useful, interesting

and reliable information to discover the unexpected or anomalous behaviour,

which goes against a common assumption of the data mining domain [98]. An

example where anomalous behaviour is critically important is in industrial control

SCADA (supervisory control and data acquisition) networks, which manifest a

regular and expected system behaviour for its limited and repetitive actions. So,

any irregular or unexpected behaviour, which is very rare in a SCADA system,

deserves further investigation.

48

3.1. Introduction 49

In an industrial domain, a process is completed with a speci�c ordered sequence

of actions. For example, consider a SCADA water tank system that consists of

two water tanks and a pump that moves water from a lower tank into an upper

tank. Gravity allows water in the upper tank to move back into the lower tank.

We argue that a particular order of events can lead to an undesired result. For

example, if the upper tank reservoir valve is closed and the pump is on, the

pump �lls the upper tank. When the water level reaches the 40% level of the

tank, a sensor triggers the pump o� and opens the valve. Then, water drains

out to lower tank. But, when the water level touches a certain low level mark in

the upper tank, the valve closes and triggers the pump on and starts �lling the

upper tank. For example, the following sequential ordered events are a regular

system pro�le for �lling a water tank reservoir:

〈{Close_V alve}, {Pump_On}, {Check_Water_Level_40%}, {Pump_Off},{Open_V alve}〉

However, if these events are performed in a di�erent order the system can �ood.

We assume that the water level in the upper tank reservoir is above the 40% level

mark of its capacity and the valve is triggered to open. Therefore, the water from

the upper tank reservoir starts draining in to the lower tank reservoir. But, when

the water level of the upper tank reservoir touches the 40% of its level mark, the

sensor triggers the pump o�. Then the valve gets closed and triggers the water

pump on. An example of these ordered sequential events are given below:

〈{Open_V alve}, {Check_Water_Level_40%}, {Pump_Off}, {Close_V alve},{Pump_On}〉

If the system runs with the above order of events, then there are no further

checks before the water level reaches its upper threshold capacity. As a result,

the upper tank reservoir could over�ow and cause the system to �ood.

The fundamental di�erence between itemset pattern mining and sequence

pattern mining is that in itemset pattern mining the order of the events is not

considered whereas in sequence pattern mining the order of the events is main-

tained and considered important. The importance of the order is shown in the

50 Chapter 3. A Rare Sequential Pattern Mining Approach for Anomaly Detection

above water tank reservoir example, where the incorrect order of the execution

of events can cause the system to malfunction. Even though, at the logical level

both itemset and sequential pattern mining are the same when the candidate

generator patterns are generated level-wisely by adding an item to an existing

pattern. However, there is a methodical di�erence as to how we add items into

the pattern. In itemset mining, an item is added into a set of items whereas in

sequential pattern mining the item is added at di�erent position of the pattern.

For example, if 〈{a}, {b}〉 is a pattern of size-2 and we need to generate a can-

didate pattern of size-3 by adding the item {c}, then in itemset mining we get

only the candidate pattern {a, b, c}. However, in sequential pattern mining we

get three di�erent candidate sequence patterns, 〈{c}, {a}, {b}〉, 〈{a}, {c}, {b}〉,and 〈{a}, {b}, {c}〉. Therefore, in sequence pattern mining the integrity of the

original pattern can be preserved. This means that the occurrence order of the

events in a pattern is not changed after adding new items. However, in itemset

pattern mining there is no integrity because there is no order for an itemset.

Therefore, we argue that if we only consider the rare itemset patterns, which

are not concerned about event order, it cannot be e�ective in identifying impor-

tant sequential anomalies. Also, as some adversaries may not have an in-depth

understanding of a targeted system, they may try actions which are incompat-

ible with a system's prede�ned action sequences, that would result in rare and

abnormal events. Furthermore, the order of the events may help to �nd causal

relationship. For example, Fan-Failure and Device-Down are two events and be-

tween these two events there is a causal relationship that Fan-Failure leads to

Device-Down [99]. It means that, the malfunction of the fan occurred �rst, then

the device stops working. But the other event order, meaning Device-Down then

Fan-Failure does not show the causal relationship.

It is often perceived that usually shorter patterns are frequent, while by na-

ture, longer patterns are likely to be rare or infrequent and their combination

can be even more rare. If a combination of two frequent single items becomes

infrequent it is called a rare pattern. These patterns could be useful in di�erent

application domains for �nding anomalies. The amount of rare patterns could

be large in number. The number of detected rare patterns depends on the user

de�ned support value. The support value is used to count the frequency of a

sequence in a sequence database. If the support value is high, the number of

generated rare patterns is high, but if the support value is low, a small number


of rare patterns is generated. However, for sequential patterns, these rare pat-

terns are still largely due to the di�erent arrangement of the event order in a

pattern.

Therefore, identifying useful rare patterns is challenging with sequential pat-

tern mining. As there is an increasing demand for rare patterns in anomaly

detection in network security, medicine, genetics and molecular biology [101], we

are motivated to �nd anomalies with rare sequential patterns. The present work

builds upon the authors' previous work [102]. In the previous work they have

only discovered the minimal rare sequential generator patterns that are consid-

ered the seed of all rare patterns and have shown that the orders of the events

in a pattern are important in a system where the events occur in a sequential

manner.

There are three aspects of the contributions in this work. Firstly, this work is

the �rst approach to �nd rare sequential patterns. Secondly, we present a struc-

tured generator-based method to generate all rare sequential patterns. Thirdly,

di�erent application domains may have di�erent interest in the sizes of the rare

patterns. Some domains may prefer the smallest patterns over the largest pat-

terns to �nd anomalous behaviour of a system, while others may consider the

largest pattern is better to reach a conclusion about anomalies in a system. So,

to cater to these demands, this method separates all rare sequential patterns into

di�erent groups having the same frequency. In each group, the smallest pattern

is called the minimal pattern and the largest pattern is called the maximal pat-

tern. Sometimes there will be more than one minimal and maximal patterns in

a group. The minimal patterns �nd the source or seed of attacks while maximal

patterns �nd the size or level of the disruption. Finally, this method is applied

to real SCADA control system logs to validate the usefulness of the approach in

�nding anomalies.

The remaining sections of this paper are organized as follows. Section 3.2

gives a discussion of de�nitions related to our work. Section 3.3 explains our

proposed novel method for �nding minimal and maximal rare patterns using se-

quential pattern mining. Section 3.4 gives the details of experimental procedures

and results. Section 3.5 provides analysis and discussion of the �ndings. Section

3.6 discusses the related research, and �nally, Section 3.7 draws conclusions and

future work.


3.2 De�nitions

We begin by presenting the de�nitions of some concepts required to formally

describe our approach and algorithms. Some of the concepts are commonly used

in related works such as the one in [103].

De�nition 1 (Sequence): Let I = {i1, i2, i3, ..., in} be a set of items and an itemset

X (also called an event) is a subset of unordered distinct items, that is, X ⊆ I. A

sequence S is an ordered list of itemsets, i.e., S = 〈X1, X2, ..., Xk〉, where Xj ⊆ I

and 1 ≤ j ≤ k. Note that, while an item could appear only once inside an event,

within a sequence an item may appear in multiple events.

De�nition 2 (Sequence database): A sequence database SDB comprises of a set of

sequences, such as SDB = {S1, S2, ..., Sp} where Sj is a sequence. Each sequenceof the database SDB is identi�ed with a unique sequence identi�er SID.

For example, let a sequence S in the database SDB, and Events(S) be a set of

events which occur in S. The set of all events which occur in the database SDB

is de�ned as Events(SDB)=∪S∈SDB Events (S).

De�nition 3 (Sequence containment): A sequence Sa = 〈A1, A2, ..., An〉 is saidto be contained in a sequence Sb = 〈B1, B2, ..., Bm〉 if and only if there exist

integers 1 ≤ i1 < i2 < ... < in ≤ m, n ≤ m, such that, A1 ⊆ Bi1, A2 ⊆ Bi2,

..., An ⊆ Bin and this is denoted as Sa v Sb. In this case Sa is considered as

a subsequence of Sb and Sb is said to be a super sequence of Sa, where n ≤ m

indicates that the number of elements in a subsequence must be less than or equal

to the number of elements in a super sequence.

For example, 〈{b}, {c, e}〉 is a subsequence of 〈{b, d}, {c, e, f}, {h}〉, while 〈{a},{b}〉 is not a subsequence of 〈{a, b}, {c, d}〉.

De�nition 4 (Size and Length of a sequence): The size |S| of a sequence is

the number of events or itemsets that exist in that sequence. While the length

of a sequence is the total number of individual items (repetition of an item is

considered) counted in that sequence.

For example, S = 〈{a}, {b, c}, {b, d}, {e}〉 is a sequence with four events and

hence the size of this sequence is 4 denoted as size-4, whereas the length of this

sequence is 6 because the total number of items is 6.

3.2. De�nitions 53

Table 3.1: A sequential database SDB.

Sequence ID SequencesSID1 〈{a}, {b, d}, {e}, {c}〉SID2 〈{a}, {c}, {b}, {e}〉SID3 〈{a, b}, {c}, {b}, {e}〉SID4 〈{b}, {c, e}, {e}〉SID5 〈{a}, {b, c}, {e}, {f}, {c}〉SID6 〈{g}〉

De�nition 5 (Support): The support of a sequence Sa in a sequential database

SDB is determined by the number of sequences S ∈ SDB, such that Sa v S.

For example, the pattern 〈{a}, {b}, {e}〉 is found in four sequences (SID1, SID2,

SID3, and SID5 shown in Table 3.1); therefore, the support of this pattern is

4 in absolute count or 66% in relative count.

De�nition 6 (Frequent and Rare Sequential Pattern):A frequent sequential pattern

is a sequence whose support is greater than or equal to the user de�ned support

threshold. A rare sequential pattern is a sequence whose support is less than the

user de�ned threshold.

For example, the sequential pattern 〈{a}, {b}, {e}〉 is frequent while pattern

〈{a, b}, {c}〉 is rare in the database SDB in Table 3.1 when the user provided

support threshold value is 2. These frequent and rare sequential patterns can

be represented by a lattice which is a structure that describes the relationship

among the sequential patterns in a sequential database. This is similar to the

structure used by Pasquier et al. [104]. In both cases, the relationship is an

ordered relationship between the superpatterns and subpatterns. However, the

lattice used by Pasquier et al. [104] is a subset relationship while the lattice used

in this paper is built on a subsequence relationship. For the sequential database

in Table 3.1, a simpli�ed partial lattice is given in Figure 3.1 which illustrates the

relationship, shown in thin lines, among the sequential patterns in the database.

The subsequence relationship among the patterns is a partial ordered relation-

ship. For example, the sequential pattern 〈{a}, {c}, {e}〉 is a subsequence of thesequential pattern 〈{a}, {c}, {e}, {f}〉 and 〈{a}, {c}, {f}, {e}〉 and their relation-ships are shown in the Figure 3.1. However, there is no subsequence relationship

between 〈{a}, {c}, {e}, {f}〉 and 〈{a}, {c}, {f}, {e}〉.


Figure 3.1: A partial lattice view of a sequential database.

De�nition 7 (Closed and Generator Sequential Pattern):A sequential pattern sa

is called a closed pattern if there is no other sequential pattern sb such that sa vsb and their supports are equal. A sequential pattern sa is called a generator if

there is no other sequential pattern sb such that sb v sa and their supports are


equal.

For example, the pattern 〈{a, b}, {c}, {e}〉 is a closed pattern because there are

no super sequential patterns which have the same support. The pattern 〈{a, b}〉is a generator pattern because no sub-sequential patterns are found which have

the same support.

De�nition 8 (Rare Sequential Generator Pattern):A sequential pattern is called

a rare sequential generator pattern (RSG) if it is rare and all of its proper sub-

sequences are frequent.

For example, sequence 〈{a, b}〉 is a rare sequential generator pattern (shown in

thick bordered shaded rectangle on the right side of the solid slanted line in Fig-

ure 3.2) because all of its subsequence patterns 〈{a}〉, 〈{b}〉 are frequent whenthe minimum support threshold is 2.

De�nition 9 (Zero Sequential pattern and Minimal Zero Sequential Generator

Pattern): A sequential pattern is called a zero sequential pattern if and only if

it does not exist in the database, i.e., its support value is 0. A zero sequential

pattern is called a minimal zero sequential generator pattern (mZRSGr) if and

only if all of its proper sub sequences are non-zero sequential patterns.

A minimal zero sequential generator pattern is the upper border limit of a valid

rare sequential pattern. It is noted that all the super sequential patterns that

can be generated from mZRSGr do not appear in the database.

For example, non-existent sequence pattern 〈{a}, {f}, {d}〉 which is shown in a

small dotted rectangle with a number zero on top in Figure 3.1 is a minimal zero

sequential generator and all super sequences of this pattern do not exist.

De�nition 10 (Equivalence Class):Two patterns Sa and Sb are in the same equiv-

alence class with respect to a sequential database SDB if and only if for each

sequence s ∈ SDB, we have Sa v s if and only if Sb v s. In other words, all

the patterns in the equivalence class occur in the same sequences in the SDB.

Therefore, they have the same support.

For example, the rare patterns, such as 〈{a, b}〉, 〈{a, b}, {c}〉, 〈{a, b}, {e}〉, 〈{a, b},{c}, {e}〉, 〈{b}, {c}, {e}〉 form an equivalence class. These patterns have the same

support and they belong to the same sequence (SID3) of the SDB (shown in

Table 3.1). Note that, generator patterns are the smallest or minimal pattern


in size while the closed patterns are the largest or maximal patterns among

all of the member patterns of an equivalence class. An equivalence class may

contain more than one generator and closed patterns. For example, the sequen-

tial patterns 〈{d}〉, 〈{b, d}, {c}〉, 〈{a}, {d}〉, 〈{a}, {b, d}〉, 〈{d}, {c}〉, 〈{d}, {e}〉,〈{b, d}, {e}〉, 〈{a}, {b, d}, {e}〉, 〈{a}, {b, d}, {c}〉, 〈{a}, {d}, {e}〉, 〈{d}, {e}, {c}〉,〈{a}, {d}, {c}〉, 〈{b, d}, {e}, {c}〉, 〈{a}, {b, d}, {e}, {c}〉, 〈{a}, {d}, {e}, {c}〉 fromthe database given in Table 3.1 form an equivalence class shown in Figure 3.3.

In this equivalence class, the sequential patterns 〈{d}〉 and 〈{b, d}〉 are the

minimal rare sequential generator patterns, while the rare sequential patterns

〈{a}, {b, d}, {e}, {c}〉 and 〈{a}, {d}, {e}, {c}〉 are the closed sequential patterns.

It is worth noting that at a high level, itemset pattern mining and sequence

pattern mining can be parallel. However, sequence pattern mining di�ers from

itemset mining, especially when the candidate sequence generator pattern is gen-

erated. For example, if a candidate sequence pattern is generated from two items

a and b, then in itemset mining only one candidate itemset pattern {a,b} can

be generated. However, in sequence mining there are two possible candidate se-

quence patterns that can be generated, i.e., 〈{a}, {b}〉 and 〈{b}, {a}〉. Therefore,if the sequence of events is taken into consideration while calculating the support,

the itemset pattern is di�erent from the sequential pattern. On the other hand,

if the sequence is ignored during the support calculation, the itemset pattern

mining and the sequential pattern mining are the same.


Figure 3.2: The positive and the negative border of a lattice of a sequentialdatabase.

Mannila and Toivonen in their work [105] de�ned the notions of positive border

and negative borders. According to Mannila and Toivonen, the maximal frequent

patterns form the positive border of the frequent zone and the minimal rare

patterns form the negative border of the infrequent zone. Figure 3.2 depicts only

a partial view of a lattice structure that represents the positive and the negative

borders of the sequential patterns of the sequential database in Table 3.1. The


solid slanted line separates the rare patterns on the right side and the frequent

patterns on the left side. The minimal rare sequential generators along the right

side of the solid slanted line form the negative border and the maximal frequent

sequential patterns on the left side of the solid slanted line is positive border.

The rare generators hold the important property that all of their subsequences

are frequent and as such they must have higher support value than the rare gen-

erators. For example, minimal rare sequential generator patterns 〈{d}〉, 〈{f}〉,〈{g}〉, 〈{a, b}〉, 〈{c, e}〉, 〈{b, d}〉, 〈{b, c}〉, 〈{b, b}〉, 〈{c, c}〉, 〈{e, e}〉, which are

shown in thick bordered shaded rectangle form the negative border along the

right side of the solid slanted line. On the other hand, maximal frequent se-

quential patterns 〈{a}, {b}, {c}〉, 〈{a}, {e}, {c}〉, 〈{a}, {c}, {e}〉, 〈{c}, {b}, {e}〉,〈{a}, {b}, {e}, {c}〉, which are shown in dash and dot rectangle form the positive

border along the left side of the solid slanted line.

The pattern in dotted rectangle shape on the right side of the solid slanted

line is a non-existent pattern, meaning this pattern never exists in the SDB.

As there exists many non-existent patterns, most are omitted and are shown as

three dots(...) in Figure 3.1. The frequent generators which are shown in dashed

rectangles and other frequent patterns which are shown in solid and dash and

dot rectangles on the left side of the solid slanted line. The frequent generators

are the frequent patterns for which there are no subsequence patterns with the

same support value. On the other hand, for the patterns shown in dotted rect-

angles at least one of its subsequences has the same support value. The rare

generators hold the important property that all of its subsequences are frequent

sequences including frequent generators. Further more, for both frequent and

rare generators, the support of any subsequence of a generator must be higher

than the support of the generator. This property is de�ned by Pasquier et al.

[104] as follows:

Property 1: A sequence X is a generator if and only if the support of X is lower

than the support of Y, Y is any subsequence of X.

In this chapter, we propose methods to generate all generators including both

frequent and rare generators based on Property 1 (Phase 1), then generate all

rare sequential patterns from the rare generators (Phase 2).

3.3. A New Method For Finding Rare Sequential Patterns 59

3.3 A New Method For Finding Rare Sequential

Patterns

In this section we present a new general method for �nding rare sequential pat-

terns. This method comprises of two phases. In the �rst phase, we generate rare

sequential generator patterns which are the seeds for generating all rare sequen-

tial patterns in the second phase. Note that, the subsequence patterns of a rare

generator pattern are generator patterns, but could be frequent generators or rare

generators. According to Szathmary et al. [98], there are two ways to generate

rare patterns, either from frequent generators or from maximal frequent patterns.

The later method requires �nding all frequent patterns up to maximal patterns

from where all rare generators can be produced. But this method, generating

rare generators from maximal frequent patterns, requires greater computational

time because the method needs to explore all frequent patterns to generate max-

imal frequent patterns. Hence, we propose to use frequent generators [106] to

�nd all rare generators allowing us to use only a subset of the frequent patterns.

To this process, we split the set of events Event(SDB) into two, a set of rare

size-1 sequences and a set of frequent size-1 sequences. The size-1 rare patterns

are the rare generator sequential patterns while the frequent patterns of size-1

are frequent generator sequential patterns. We exclusively use these frequent

generator sequential patterns to �nd further rare generator sequential patterns

based on Property 1 by applying an apriori-like method. The method is formally

described in Algorithm 3.1. The rare generators are the seeds for generating all

rare sequential patterns.

In the second phase, we generate all rare sequential patterns by generating su-

per patterns of rare patterns starting from rare sequential generators. According

to the apriori property [55], any super pattern of an infrequent pattern must be

infrequent. The rare sequential patterns are further increased by merging with

size-1 sequences to generate more rare sequential patterns until no more new rare

patterns can be generated. In the process of making super sequential patterns,

we keep the integrity of the original sequence, from where candidate super se-

quences are generated, by not changing the order of the itemsets or events of

the original sequence. In other words, even if we add a size-1 itemset forming a

size-1 sequence also into di�erent positions of the original sequence, the itemsets'

order of the original sequence remains intact in the generated candidate super


sequences. For example, given two sequential patterns S1 = 〈{d}, {c}, {b}〉 andS2 = 〈{a, b}〉 from the sequential database SDB (shown in Table 3.1), there could

exist di�erent ways to combine them to form the new candidate super sequential

patterns at the sequence level. So, we can combine them in the following possible

ways:

〈{a, b}, {d}, {c}, {b}〉, 〈{d}, {a, b}, {c}, {b}〉, 〈{d}, {c}, {a, b}, {b}〉,and 〈{d}, {c}, {b}, {a, b}〉.

In the above patterns, the order of the itemsets of the original pattern S1 from

where all possible candidate super sequence patterns are generated remain un-

changed. Another way of generating the candidate super sequences ensuring the

integrity is by concatenating the two sequences S1 and S2 in both forward and

reverse direction, such as 〈{S1}, {S2}〉 and 〈{S2}, {S1}〉.

〈{a, b}, {d}, {c}, {b}〉 and 〈{d}, {c}, {b}, {a, b}〉.

As a result, only the above two possible candidate super sequence patterns are

possible at all times irrespective of the size of the original sequence S1. Therefore,

we have applied the former method generating all possible candidate patterns in

our algorithm.

Note that, the total number of possible super patterns would be equal to the

size of |S1| + 1. For example, the size of the pattern S1 is 3 for having three

events. Therefore, the total number of super patterns grown from the pattern

S1 to 4. Also, the size of the super sequences would be the sum of the size of

the sequences S1 and S2. Similarly, the length of the super sequences would be

the sum of the length of the sequences S1 and S2. In other words, if the size

and length of S1 is K and L respectively, for S2 length is L′ then the size and

the length of the super patterns would be (K + 1) and (L+L′) respectively. For

example, the size and length of the pattern S1 is 3, but for the pattern S2 the

size is 1 and the length is 2. So, the size and the length of the super patterns

become 4 and 5, respectively.

Also, in this stage all rare patterns are separated into di�erent equivalence

groups based on their similar frequency and the sequences in the SDB they

belong. Rare patterns that have the same frequency and they come from the


same sequences in the database SDB, they are put into an equivalence class.

Note that, among all rare patterns in an equivalence class, we can also identify

the smallest as well the largest rare patterns as per requirements from di�erent

domains.

3.3.1 Generating Rare Sequential Generator Patterns

In this phase, all rare sequential generator patterns (RSG) are discovered from

the sequential database SDB. First, based on the frequency test (comparing

the number of sequences against the de�ned support threshold value minsup),

Events(SDB) is divided into rare and frequent zones. The sequences in the rare

zone are size-1 rare generators. On the other hand, the frequent zone sequences

are size-1 frequent generators. This is done in steps 1-15 in Algorithm 3.1 given

below.

To �nd size-2 and above rare generators, size-(s−1) frequent generators, s ≥2, are combined in pairs to generate size-s super patterns that share size-(s−1)common pre�x subsequences. For example, two frequent generator patterns of

size-3 〈{c}, {b}, {e}〉 and 〈{c}, {b}, {f}〉 share a common pre�x size-2 subse-

quence 〈{c}, {b}〉, so these two patterns produces two super sequences of size-4,namely 〈{c}, {b}, {e}, {f}〉 and 〈{c}, {b}, {f}, {e}〉. In other words, keeping the

common pre�x sequence unchanged, the remaining su�x sequences are merged

together in both forward and reverse directions. These super sequence patterns

are tested against the maxsup value to generate frequent generators and rare

generators. This is done in steps 18 to 32 in Algorithm 3.1.

The above process continues until no more frequent generator can be found.

In the third iteration with the database SDB, the process stops as there exist

no frequent generators. So, at the end of the process, all rare generator patterns

are collected. For demonstrating the process of Algorithm 3.1, the execution

of Algorithm 3.1 with a maxsup value set to 2 on the database SDB given in

Table 3.1 is illustrated in Table 3.2. At �rst, the algorithm �nds all the size-

1 sequences as candidate sequence generator CSG1 in Table 3.2(a) in a single

database scan. Then the support values of these sequences are counted. Since

an empty sequence is a subsequence of every sequences in the SDB, it is found

that empty sequence is frequent and its support is equal to the total number of

sequences in the SDB. In other words, the empty sequence has a 100% support

value shown as FSG0 in Table 3.2(a).


Algorithm 3.1: Finding Rare Sequential Generator Patterns.Input: A sequential database SDB, maxsupOutput: Rare Sequential Generator Patterns (RSG)

1 CSG1 ← {〈e〉|∀e ∈ Events(SDB)} // Here CSG1 is a set of candidatesequence generators with size-1 sequences

2 FSG1 ← {}, RSG1 ← {} // Here FSG1 and RSG1 is a set of frequentsequential generator and rare sequential generator respectively

3 S.supp0 = |SDB|,∀S ∈ CSG1

4 Count support S.supp1 of each sequence S in CSG1 by scanning theSDB

5 for S ∈ CSG1 do6 if S.supp1 = S.supp0 then7 remove S from CSG1

8 else9 if S.supp1 > maxsup then10 FSG1 ← FSG1 ∪ {S}11 else12 RSG1 ← RSG1 ∪ {S}

13 s ← 214 FSGs ← {}, RSGs ← {}15 while FSGs−1 not empty do16 CSGs ← all possible combinations of two sequences with common

pre�x of size(s-2) subsequences in FSGs−117 for S ∈ CSGs do18 ms ← minimum support of the size(s-1) subsequences of S19 S.supps ← 020 for a ∈ SDB do21 if S v a then22 S.supps ← S.supps+123 else24 continue

25 if S.supps = ms then26 remove S from CSGs

27 else28 if S.supps > maxsup then29 FSGs ← FSGs ∪ {S}30 else31 RSGs ← RSGs ∪ {S}

32 s ← s+1

33 return RSG = RSG1 ∪RSG2...RSGs−1


Table 3.2: Execution of Algorithm 3.1.

(a) First iteration

CSG1

minsup

ofFSG0

SupOf

CSG1

True

CSG1

〈{a}〉 6 4 Yes

〈{b}〉 6 5 Yes

〈{c}〉 6 5 Yes

〈{d}〉 6 1 Yes

〈{e}〉 6 5 Yes

〈{f}〉 6 1 Yes

〈{g}〉 6 1 Yes

〈{a,b}〉 6 1 Yes

〈{b,c}〉 6 1 Yes

〈{c,e}〉 6 1 Yes

〈{b,d}〉 6 1 Yes

RSG1

SupOf

RSG1

FSG1

SupOf

FSG1

〈{d}〉 1 〈{a}〉 4

〈{f}〉 1 〈{b}〉 5

〈{g}〉 1 〈{c}〉 5

〈{a,b}〉 1 〈{e}〉 5

〈{b,c}〉 1

〈{c,e}〉 1

〈{b,d}〉 1

(b) Second iteration

CSG2

minsup

ofFSG1

SupOf

CSG2

True

CSG2

〈{a},{a}〉 4 0 No

〈{a},{b}〉 4 4 No

〈{b},{a}〉 4 0 No

〈{a},{c}〉 4 4 No

〈{c},{a}〉 4 0 No

〈{a},{e}〉 4 4 No

〈{e},{a}〉 4 0 No

〈{b},{b}〉 5 1 Yes

〈{b},{c}〉 5 4 Yes

〈{c},{b}〉 5 2 Yes

〈{b},{e}〉 5 5 No

〈{e},{b}〉 5 0 No

〈{c},{c}〉 5 1 Yes

〈{c},{e}〉 5 4 Yes

〈{e},{c}〉 5 2 Yes

〈{e},{e}〉 5 1 Yes

RSG2

SupOf

RSG2

RSG2

SupOf

FSG2

〈{b},{b}〉 1 〈{b},{c}〉 4

〈{c},{c}〉 1 〈{c},{b}〉 2

〈{e},{e}〉 1 〈{c},{e}〉 4

〈{e},{c}〉 2

(c) Third iteration

CSG3

minsup

ofFSG2

SupOf

CSG3

True

CSG3

〈{c},{b},{e}〉 2 2 No

〈{c},{e},{b}〉 2 0 No

RSG3

SupOf

RSG3

FSG3

SupOf

FSG3

〈{}〉〈{}〉


To �nd true sequence generators, based on Property 1 the candidate sequence

generators are pruned if they have the equal support with the minimum support

of its subsequences. For size-1, all CSG1 in Table 3.2 (a) turned out true CSG1.

Then, the true CSG1 are tested against the user de�nedmaxsup value to �nd the

rare sequence generators RSG1 and frequent sequence generators FSG1. For the

example in Table 3.2, 7 size-1 rare generators and 4 frequent generators are found.

Next, the FSG1 are combined in pairs to generate size-2 CSG2 in Table 3.2(b)

keeping the common pre�x subsequence of FSG1, which is an empty sequence.

In CSG2 there exist some true or potential CSG2 and some not true CSG2 as

these pattern have the same support with one of its subsequences patterns. For

example, 〈{a}, {b}〉 in CSG2 has the same support with one of its subsequence

〈{a}〉. The true CSG2 are further separated into rare sequential generators

RSG2 and frequent sequential generators FSG2 in Table 3.2(b). Finally, after

the third iteration in Table 3.2(c), there exist no frequent sequential generators

except empty sequence. So, no further candidate sequence can be generated and

the algorithm stops. The rare generators are collected from RSGs where s is the

size of the rare generators.

3.3.2 Generating All Rare Sequential Patterns

In the second phase, all rare sequential patterns are generated from RSG pat-

terns discovered in Algorithm 3.1 which is explained in Section 3.1. We propose a

level-wise method to generate all rare sequential patterns. The proposed method

for generating all the rare sequential patterns is formally described in Algorithm

3.2. Starting from the size-1 rare generators, which are shown in thick bordered

shaded rectangles in the lattice in Figure 3.1, for each size-s, the method dis-

covers size-(s+1) rare patterns from all possible size-(s+1) patterns generated

by extending each size-s rare pattern with every event in Events(SDB), as de-

scribed in steps 6 to 35 of Algorithm 3.2. The variable CRSPs+1 in Algorithm

3.2 contains all potential size-(s+1) rare patterns, which are generated in steps 7

to 13. In this process the order of the events in the sequence remains unchanged

which ensures the integrity of the original sequence. For example, 〈{b, c}, {d}〉is a size-2 rare sequential pattern which can be extended to three possible size-3

rare sequential patterns 〈{a}, {b, c}, {d}〉, 〈{b, c}, {a}, {d}〉, and 〈{b, c}, {d}, {a}〉with an event {a} in SDB. After the merging, all patterns of size-(s+1) are gen-

erated. These patterns fall either into rare patterns which have the frequency


Algorithm 3.2: Generating all Rare Sequential Patterns and theirEquivalence Classes.Input: a sequential database SDB, a set of rare sequential generators

RSGOutput: Generating all rare patterns and their equivalence classes

1 NEP ← {} // Here NEP holds all non-existent patterns in SDB2 GRSP ← {} // set of equivalence classes that have the same support

and they occur in the same sequences in SDB3 s ← 14 RSPs ← {g|g ∈ RSG, |g| = 1} //size-1 rare generators5 ms ← maxS∈SDB{|S|}6 while s < ms and RSPs 6= empty do7 CRSPs+1 ← {} // candidate rare sequence patterns of size-(s+1)8 for each S in RSPs do9 for each e in Events(SDB) do10 C ← all sequences generated by adding e into S at di�erent

positions11 CRSPs+1 ← CRSPs+1 ∪ C

12 RSPs+1 ← {}13 for each S in CRSPs+1 do14 if there is n in NEP such that n is a subsequence of S then15 continue16 else17 RSPs+1 ← RSPs+1 ∪ {S}

18 RSPs+1 ← RSPs+1 ∪ RSGs+1

19 for each S in RSPs+1 do20 S.supp ← 0, S.sid ← {}21 for a ∈ SDB do22 if S v a then23 S.supp ← S.supp+124 S.sid ← S.sid ∪ {a.sid} //a.sid is the id of sequence a25 else26 NEP ← NEP ∪ {S}

27 sp ← S.supp, sid ← S.sid28 if GRSPsp,sid is in GRSP then29 GRSPsp,sid ← GRSPsp,sid ∪ {S}30 else31 GRSPsp,sid ← {S}32 GRSP ← GRSP ∪ {GRSPsp,sid}

33 s ← s+134 RSPs ← RSPs ∪ RSGs

35 return RSP = RSP1 ∪RSP2... ∪RSPs−136 return GRSP


below the support threshold value, e.g., minsup = 2 for the database in Table

3.1, or non-existent patterns which have frequency zero, meaning that these pat-

terns do not exist in the database SDB. Note that, these non-existent patterns

are not used to generate further super patterns as these patterns will also be

non-existent patterns. Also, no frequent patterns are formed at this stage as

these patterns grow from minimal rare generators, meaning any super patterns

originate from rare patterns will not be frequent.

Figure 3.3: An equivalence class of rare sequential patterns.

In addition, at every stage, the rare patterns are grouped together into di�erent

equivalence classes based on their equal support value and common sequences in

the SDB. In Algorithm 3.2, for a pattern S, its support is denoted as S.supp, and

the set of sequence IDs where S occurs is denoted as S.sid. For example, the rare

3.4. Evaluation 67

sequential patterns in Figure 3.3 form an equivalence class, which comprises of

16 rare sequential patterns because all of these patterns have the same support

value 1 and they belong to the same sequence SID3 of the database SDB shown

in Table 3.1. At the end of the process, all rare sequential patterns in the SDB

are distributed into di�erent equivalence classes by Algorithm 3.2. The smallest

rare sequential patterns 〈{d}〉 and 〈{b, d}〉 shown in thick bordered shaded rect-

angles at the bottom of the Figure 3.3 are the minimal rare sequential generator

patterns. The maximal rare sequential patterns shown in dashed rectangles at

the top of the Figure 3.3 are the closed sequential patterns. All other rare se-

quential patterns shown in thick bordered unshaded rectangles in Figure 3.3 are

generated from the minimal rare sequential generator patterns.

3.4 Evaluation

In this section, we outline the experimental methodology used to evaluate our

proposed algorithms to �nd minimal, maximal and all rare sequential patterns

to detect anomalies in SCADA control logs. Logs which are used in the exper-

iment are o�-line control logs. The o�-line log means that SCADA activities

were recorded in log �les, which were later pre-processed to use with the rare

sequential pattern mining algorithm. Firstly, we present the experiment setup

representing a complete industrial control system running on three di�erent con-

trol systems. These are the conveyor belt system, the water tank system, and

the pressure control system. Secondly, we describe the datasets used throughout

the experiment to evaluate our proposed methods. Finally, we explain the results

validating our proposed methods.

3.4.1 SCADA System Architecture

This thesis uses three separate SCADA control systems such as conveyor belt

control system, pressure control system and water tank control system that are

con�gured in the SCADA laboratory. All of these control systems are connected

by a human machine interface (HMI), which not only monitors and controls the

processes, but also collects the activity logs being generated by the programmable

logic controllers (PLCs) in each of the systems.


The conveyor belt system consists of two loops of conveyors belts functioning in

two di�erent directions, left and right. As light and dark objects pass a sensor,

a sorting arm changes direction moving the object to either the left or right

conveyor belt depending on the color of the object. This process continues for a

prede�ned time period. After the conveyor belt �nishes a process cycle, the water

tank control system starts its operation. A sensor device is used to monitor the

water level as a percentage of the total water level capacity of the upper tank.

If water crosses the highest or lowest level of the water tank, then an alarm

is triggered to indicate the upper tank's over-�ow or under-�ow condition. This

process repeats for a prede�ned time period. In the �nal stage of the process, the

pressure control system pumps air into a sealed steel pipe system and increases

the air pressure. At a given upper pressure threshold value, which is measured

in pounds per square inch (PSI), the air pressure is released through a solenoid

valve. The air pressure in the pipe drops. Once a set lower pressure level has been

reached, the air compressor starts up again building pressure in the pipe system.

This process continues for a prede�ned period of time. Once the compressed air

pipeline stage of the process is complete, a full cycle of the process has completed.

The system starts again with the conveyor belt control system.

3.4.2 Datasets

To evaluate our proposed algorithms, we conducted experiments on two sets of

data. The �rst set of data (First Dataset) comprising Dataset-1, Dataset-2, and

Dataset-3 represents the logs collected from the conveyor belt, the pressure con-

trol and the water tank control system described in Section 4.1. A partial view

of these datasets are shown in Table 3.3, Table 3.4 and Table 3.5 respectively.

These datasets were collected from a training session held in the SCADA lab-

oratory from 9.30 a.m. to 4.00 p.m. Three process control systems (Conveyor

belt, Pressure control, and Water tank) were switched on and started functioning

smoothly on the training day.

Two highly skilled professional teams named the blue team and the red team

participated in the training session. The aim of the blue team was to operate and

monitor the SCADA control system equipment while the aim of the red team was

to conduct cyber-attacks on the control system. The system was compromised

by the red team who was able to successfully disrupt the processes running on

all three control systems in the latter half of the day. All the events (regular

3.4. Evaluation 69

Table 3.3: A partial view of a conveyor belt control system log.

VarName TimeString VarValueConv_Read_Conv_Color_PE 16/07/2015 9:31:11 AM 0Conv_Read_Conv_HMI_Direction 16/07/2015 9:31:11 AM 0Conv_Read_Conv_Present_PE 16/07/2015 9:31:11 AM 0Conv_Read_Solenoid_Left_Direction 16/07/2015 9:31:11 AM -1Conv_Read_Solenoid_Right_Direction 16/07/2015 9:31:11 AM 0Conv_Run_Status 16/07/2015 9:31:11 AM 0HMI_Conv_Direction 16/07/2015 9:31:11 AM 0HMI_Conv_Master_Mode 16/07/2015 9:31:11 AM -1HMI_Conv_Reset 16/07/2015 9:31:11 AM 0

and attacked) were recorded in the log �les. There were a total of 205 868, 228

762, and 388 877 lines of logs recorded in Dataset-1, Dataset-2 and Dataset-3

respectively during the whole-day training period. The logs were recorded under

5 di�erent attributes or features. These are VarName, TimeString, VarValue,

Validity, and Time_ms.

Table 3.4: A partial view of a pressure control system log.

VarName TimeString VarValueHMI_Pipe_Pump_O�_SP 16/07/2015 9:31:11 AM 40HMI_Pipe_Pump_On_SP 16/07/2015 9:31:11 AM 5HMI_Pipe_Solenoid_O�_SP 16/07/2015 9:31:11 AM 30HMI_Pipe_Solenoid_On_SP 16/07/2015 9:31:11 AM 40HMI_Pipe_Master_Mode 16/07/2015 9:31:11 AM -1Pipe_Pump_Run_Status 16/07/2015 9:31:11 AM -1Pipe_Read_Pipeline_Pressure 16/07/2015 9:31:11 AM 0.3130435Pipe_Read_Pump_Mode 16/07/2015 9:31:11 AM 0Pipe_Read_Pump_Run_Cmd 16/07/2015 9:31:11 AM 0Pipe_Read_Solenoid_Mode 16/07/2015 9:31:11 AM 0

The attribute VarName holds the name of the events occur in the control system,

TimeString record the date and time when the event occurs, the VarValue holds

the value of the event in VarName, the validity is used to check the �oating point

of VarValue, and �nally Time_ms additionally holds the TimeString value in

milliseconds. Among these 5 attributes we have only used 3 attributes named

VarName, TimeString, and VarValue in our experiment since these attributes

hold the important required information in �nding anomalies.


Table 3.5: A partial view of a water tank control system log.

VarName TimeString VarValueHMI_Tank_Master_Mode 16/07/2015 9:31:11 AM -1Tank_Level 16/07/2015 9:31:11 AM 52.68681Tank_O�_SP_Int 16/07/2015 9:31:11 AM 80Tank_On_SP_Int 16/07/2015 9:31:11 AM 50Tank_Read_Pump_In_Auto 16/07/2015 9:31:11 AM 0Tank_Read_Pump_In_Manual 16/07/2015 9:31:11 AM 0Tank_Read_Pump_Running 16/07/2015 9:31:11 AM 0Tank_Read_Tank_Level 16/07/2015 9:31:11 AM 52.6868Tank_Stopped 16/07/2015 9:31:11 AM -1Tank_Usage_Level 16/07/2015 9:31:11 AM 47.3132

During the training session, there were 5 di�erent types of attacks that were

successfully conducted in all three control systems by the red team. The red team

was successfully able to change the diverter gate direction in the conveyor belt

between 2.49 p.m. to 3.09 p.m. The red team successfully made an attack on the

water tank system by changing the tank's mode of operation from automatic to

manual mode, which was conducted between 3.14 p.m. to 3.24 p.m. In addition,

the team also stopped the water tank while it was on at 3.24 p.m. Finally, in

the pressure control system, an attack was conducted that changed the pressure

lower threshold value and upper threshold value between 3.33 p.m. to 3.35 p.m.

on the training day.

Table 3.6: A partial view of a conveyor belt control system logs from the seconddataset.

VarName TimeString VarValue

Conv_Read_Conv_Color_PE 16/06/2017 6:55:08 PM 0

Conv_Read_Conv_HMI_Direction 16/06/2017 6:55:08 PM 0

Conv_Read_Conv_Present_PE 16/06/2017 6:55:08 PM 0

Conv_Read_Solenoid_Left_Direction 16/06/2017 6:55:08 PM -1

Conv_Read_Solenoid_Right_Direction 16/06/2017 6:55:08 PM 0

Conv_Run_Status 16/06/2017 6:55:08 PM 0

HMI_Conv_Direction 16/06/2017 6:55:08 PM 0

HMI_Conv_Master_Mode 16/06/2017 6:55:08 PM -1

HMI_Conv_Reset 16/06/2017 6:55:08 PM 0

3.4. Evaluation 71

The second set of data (Second Dataset) comprises Dataset-4, Dataset-5, and

Dataset-6 representing conveyor belt, pressure control and water tank process

control logs. These datasets were collected in a di�erent controlled experimental

set up. In other words, we recon�gured the same physical equipment using a

di�erent process control set up. Attacks were conducted to disrupt the normal

process activities of the control system network. An attack PC was connected

to the control system network. A Python script was run from the attack PC to

carry out these attacks.

Table 3.7: A partial view of a pressure control system log from the second dataset.

VarName TimeString VarValueHMI_Pipe_Pump_O�_SP 16/06/2017 6:55:08 PM 40HMI_Pipe_Pump_On_SP 16/06/2017 6:55:08 PM 5HMI_Pipe_Solenoid_O�_SP 16/06/2017 6:55:08 PM 30HMI_Pipe_Solenoid_On_SP 16/06/2017 6:55:08 PM 40HMI_Pipe_Master_Mode 16/06/2017 6:55:08 PM -1Pipe_Pump_Run_Status 16/06/2017 6:55:08 PM -1Pipe_Read_Pipeline_Pressure 16/06/2017 6:55:08 PM 17.63478Pipe_Read_Pump_Mode 16/06/2017 6:55:08 PM 0Pipe_Read_Pump_Run_Cmd 16/06/2017 6:55:08 PM -1Pipe_Read_Solenoid_Mode 16/06/2017 6:55:08 PM -1

All of the normal activities as well as the attack activities were recorded in the

logs. Later, we identi�ed the attack logs and labeled them using another Python

script to generate labeled attack data for validating the rare patterns. The rare

patterns generated from the Second Dataset can be validated with the labeled

attack dataset in contrast to the validation process using the knowledge from

domain experts in the First Dataset. We run the process control system to

generate the Second Dataset for a period of 8 hours.

There was a total of 109 053, 474 368, and 235 990 lines of logs recorded in

Dataset-4, Dataset-5, and Dataset-6 respectively. These datasets are shown in

Table 3.6, Table 3.7, and Table 3.8 respectively. These logs, like the First Dataset

logs, were also recorded under 5 di�erent attributes or features named VarName,

TimeString, VarValue, Validity, and Time_ms. We also used 3 attributes from

the Second Dataset named VarName, TimeString, and VarValue as we used in

the First Dataset in our experiment since these attributes hold the important

required information in �nding anomalies.

We carried out attacks on all the three control systems disrupting their regular


processes. The attacks conducted on the process control system to generate the

Second Dataset are as follows: Firstly, there are three types of attacks conducted

to the conveyor belt control system. These are changing the direction of the

diverter gate, unscheduled stopping the conveyor belt, and starting the conveyor

belt after unscheduled stoppage. There was a �ooding attack conducted for a

short time period that stopped and started the conveyor belt multiple times.

We kept a record of the control process logs in log �les. Later, we labeled the

attacked logs so that the discovered rare suspicious anomalous patterns can be

compared for the detection of attack patterns.

Table 3.8: A partial view of a water tank control system log from the seconddataset.

VarName TimeString VarValueHMI_Tank_Master_Mode 16/06/2017 6:55:08 PM -1Tank_Level 16/06/2017 6:55:08 PM 61.77073Tank_O�_SP_Int 16/06/2017 6:55:08 PM 80Tank_On_SP_Int 16/06/2017 6:55:08 PM 50Tank_Read_Pump_In_Auto 16/06/2017 6:55:08 PM 0Tank_Read_Pump_In_Manual 16/06/2017 6:55:08 PM 0Tank_Read_Pump_Running 16/06/2017 6:55:08 PM 0Tank_Read_Tank_Level 16/06/2017 6:55:08 PM 61.77073Tank_Stopped 16/06/2017 6:55:08 PM -1Tank_Usage_Level 16/06/2017 6:55:08 PM 38.22927

Secondly, in the pressure control system, four types of attacks were conducted.

The �rst type of attack changed the pressure control system upper threshold

value from the prede�ned set value. The second type of attack changed to the

lower threshold value from the prede�ned set value. The third type of attack

stopped and then started the pressure control system at unscheduled times.

Flooding attacks were conducted by activating and deactivating the pressure

control system in quick succession multiple times.

Finally, in the water tank control system, we conducted two types of attacks.

The �rst type of attack was to change the mode of operation of the water tank

from automatic to manual mode and later from manual mode to automatic mode.

The second type of attack caused an unscheduled stop and start to the water

tank pump. The above types of attacks were conducted as �ooding attacks by

changing the water tank from automatic to manual mode and vice-versa as well

as activating and deactivating the water tank several times.

3.4. Evaluation 73

In the preprocessing, we merged the feature or variable name VarName with its

corresponding values held by the feature VarValue. Together, the VarName and

the VarValue represent an itemset or event of the control process. For example,

the feature Conv_Read_Solenoid_Left_Direction and its corresponding value

-1 are merged together as {Conv_Read_Solenoid_Left_Direction_-1} repre-

senting an event of the conveyor belt control system from Dataset-1 in the First

Dataset.

Table 3.9: A sample of the conveyor belt SDB generated from Dataset-1 in theFirst Dataset.

SID1 〈{Conv_Read_Conv_Color_PE_0}, {Conv_Read_Conv_HMI_Direction_0}, {Conv_Read_Conv_Present_PE_0}, {Conv_Read_Solenoid_Left_Direction_− 1}, {Conv_Read_Solenoid_Right_Direction_0}, {Conv_Run_Status_0}, {HMI_Conv_Direction_0}, {HIM_Conv_Master_Mode_− 1}, {HMI_Conv_Reset_0}〉

SID2 〈{Conv_Read_Conv_Color_PE_0}, {Conv_Read_Conv_HMI_Direction_0}, {Conv_Read_Conv_Present_PE_0}, {Conv_Read_Solenoid_Left_Direction_− 1}, {Conv_Read_Solenoid_Right_Direction_0}, {Conv_Run_Status_− 1}, {HMI_Conv_Direction_0}, {HIM_Conv_Master_Mode_− 1}, {HMI_Conv_Reset_0}〉

These events in a de�ned time duration form a sequence in the sequence database

SDB. An example of the sequence database SDB from the conveyor belt control

logs (Dataset-1) is given in Table 3.9. There were 22 875 sequences comprising

the conveyor belt sequence database SDB from the Dataset-1. Each sequence

was created by comprising the number of events that occurred in every second on

the conveyor belt control system. For example, in conveyor belt control system

9 events occur in every second, so 22 875 sequences were created from 205 868

line of logs. Once the sequence database SDB was created, we ran our proposed

rare sequential pattern mining algorithm on the SDB.

Similarly, we created the pressure control SDB from Dataset-2 in the First

Dataset. The pressure control SDB is composed of 22 877 sequences from 388 877

line of logs. In pressure control system, 17 events occurred in every second. A

sample of the pressure control SDB is shown in Table 3.10. The water tank

sequence SDB was created from Dataset-3 in the First Dataset. The water tank

SDB comprises of 22 876 sequences from 228 762 line of logs. In every second 10


Table 3.10: A sample of the pressure control SDB generated from Dataset-2 inthe First Dataset.

SID1 〈{HMI_Pipe_Pump_Off_SP_40}, {HMI_Pipe_Pump_On_SP_5}, {HMI_Pipe_Solenoid_Off_SP_30}, {HMI_Pipe_Solenoid_On_SP_40}, {HMI_Pipe_Master_Mode_− 1},{Pipe_Pump_Run_Status_− 1}, {Pipe_Read_Pipeline_Pressure_0.3130435}, {Pipe_Read_Pump_Mode_0}, {Pipe_Read_Pump_Run_Cmd_0}, {Pipe_Read_Solenoid_Mode_0},{Pipe_Read_Solenoid_Open_Cmd_− 1}, {Pipe_Solenoid_Open_Status_− 1}, {Pressure_Int_− 1}, {Pump_Off_SP_Int_40}, {Pump_On_SP_Int_5}, {Solenoid_Off_SP_Int_30},{Solenoid_On_SP_Int_40}〉

SID2 〈{HMI_Pipe_Pump_Off_SP_40}, {HMI_Pipe_Pump_On_SP_5}, {HMI_Pipe_Solenoid_Off_SP_30}, {HMI_Pipe_Solenoid_On_SP_40}, {HMI_Pipe_Master_Mode_− 1},{Pipe_Pump_Run_Status_− 1}, {Pipe_Read_Pipeline_Pressure_0.326087}, {Pipe_Read_Pump_Mode_0}, {Pipe_Read_Pump_Run_Cmd_0}, {Pipe_Read_Solenoid_Mode_0},{Pipe_Read_Solenoid_Open_Cmd_− 1}, {Pipe_Solenoid_Open_Status_− 1}, {Pressure_Int_− 1}, {Pump_Off_SP_Int_40}, {Pump_On_SP_Int_5}, {Solenoid_Off_SP_Int_30},{Solenoid_On_SP_Int_40}〉

events occurred in the water tank control system. A sample of the water tank

SDB is shown in Table 3.11.

Moreover, we created three more sequence databases; the conveyor belt SDB,

pressure control SDB, and water tank SDB from Dataset-4, Dataset-5, and

Dataset-6 respectively in the Second Dataset. The conveyor belt SDB com-

prises of 12 117 sequences from 109 053 line of logs, the pressure control SDB

comprises of 27 904 sequences from 474 368 line of logs, and the water tank SDB

comprises of 23 599 sequences from 235 990 line of logs. All of the sequence

databases generated from both First Dataset and Second Dataset are used as

inputs to our proposed algorithms to generate rare sequential patterns.

3.4.3 Experimental Methodology

We carried out experiments applying our two proposed algorithms. Algorithm

3.1 `Finding Rare Sequential Generator Patterns' explained in Section 3.1 is for

generating all rare sequential generator patterns from the sequence databases,

3.4. Evaluation 75

Table 3.11: A sample of the water tank SDB generated from Dataset-3 in theFirst Dataset.

SID1 〈{HMI_Tank_Master_Mode_− 1}, {Tank_Level_52.68681},{Tank_Off_SP_Int_80}, {Tank_On_SP_Int_50}{Tank_Read_Pump_In_Auto_0}, {Tank_Read_Pump_In_Manual_0}, {Tank_Read_Pump_Running_0}, {Tank_Read_Tank_Level_52.6868}, {Tank_Stopped_− 1}, {Tank_Usage_Level_47.3132}〉

SID2 〈{HMI_Tank_Master_Mode_− 1}, {Tank_Level_52.764},{Tank_Off_SP_Int_80}, {Tank_On_SP_Int_50}{Tank_Read_Pump_In_Auto_0}, {Tank_Read_Pump_In_Manual_0}, {Tank_Read_Pump_Running_0}, {Tank_Read_Tank_Level_52.764}, {Tank_Stopped_− 1}, {Tank_Usage_Level_47.236}〉

such as conveyor belt SDB shown in Table 3.9. After generating minimal rare

generators, we then apply Algorithm 3.2 `Generating all Rare Sequential Pat-

terns in the sequence database and their Equivalence Classes' which is explained

in Section 3.2 to �nding all rare sequential patterns. These rare patterns are

grown from minimal rare generators. In addition to generating all rare sequen-

tial patterns, we also group them into di�erent equivalence classes. As a result,

we can �nd the smallest or minimal rare pattern, the largest or maximal rare

pattern and all rare patterns in between the minimal and maximal rare patterns

in an equivalence class.

These algorithms were implemented in Python 2.7. For the experiments a Dell

OptiPlex 9020 featuring Intel Core i7-4770 3.4 GHz processor, 16GB RAM and

256GB HDD was used. The operating system was Windows 7 Enterprise. We

have used Python programming language. However, the performance regarding

the computation time and memory consumption of the proposed algorithm can

be improved with an implementation in Java or the C++ language. We explored

NumPy for library functions that �nd a subsequence either consecutive or non-

consecutive in a sequence database, but could not �nd any built-in functions.

Anomalies could be rare or frequent in a system. Our proposed method

focuses on a system where the rarity of events is considered abnormal or irregular

behavior and events are recorded in a sequential manner. In other words, our

proposed method is applicable to detect anomalies that are rare in a system. An

example of a rare anomalous pattern could be a stealthy attack. The stealthy


attack is conducted on a system over a long period of time, say port scanning

over a long period of time. If the port scanning is conducted quickly in a short

time, the system can easily trace the activity as suspicious. However, if the port

scan activity is conducted slowly for a long period of time, it is di�cult to detect

as suspicious anomalous behavior. This is because the stealthy activity becomes

rare events on a system.

It was assumed that anomalies may happen rarely to a system. So, to �nd

rare occurrences of activities, the support value minsup = 2 was set through-

out the experiments involving all three datasets. Moreover, It has been ob-

served that in the domain of SCADA control systems, there exist a limited

number of processes. The events or actions forming the process are repeti-

tive. Therefore, in an attack-free or malfunction-free system, a de�ned num-

ber of itemsets or events that accomplish a process are considered as a reg-

ular or acceptable behavior. For example, {Conv_Read_Conv_Color_PE_0},

{Conv_Read_Solenoid_Left_Direction_-1}, and {Conv_Read_Solenoid_Right

_Direction_-1} are three individual itemsets or events from the sequence shown

in conveyor belt SDB in Table 3.3. As these events are performed in a repetitive

manner in the conveyor belt SDB, they become frequent events.

On the other hand, any changes to these process events or changes which de-

viate from the prede�ned order of the events, can make the events rare. For ex-

ample, the value of the feature Conv_Read_Solenoid_Left_Direction is changed

from −1 to 0, that is, the event {Conv_Read_Solenoid_Left_Direction_-1} is

changed to {Conv_Read_Solenoid_Left_Direction_0}, then the prede�ned out-

come of the process is disrupted. In other words, the object moving on the belt

is being sorted in wrong direction. Since anomalies rarely occur in a system, the

rare action that is an irregular or unacceptable behavior can be considered as a

rare anomalous pattern. Hence, it could be considered a suspicious event that

deserves further in-depth analysis.

Currently in the cyber security industry, existing intrusion detection systems

generate many suspicious alarms for the experts to analyse. However, most of

the alarms turn out to be false and only a few of them are found true. Since

our proposed method produces less number of rare patterns, it follows that there

will be less number of suspicious alarms for the experts to analyse for anomalies

and attack pattern.

3.4. Evaluation 77

3.4.4 Results

The experimental results are evaluated in two phases. In the �rst phase, the

experimented results that are obtained from the �rst dataset, are evaluated with

domain expert knowledge. This is because the �rst dataset was created from a

training session in the SCADA laboratory where two highly skilled professional

teams, the blue team and the red team, participated. The rare suspicious pat-

terns generated by the rare sequential pattern mining algorithm are identi�ed

as attack patterns by an expert which occurred during the time period when

the system was compromised by the red team. The identi�ed rare patterns were

later checked with the stored logs and veri�ed. In the second phase, the results

are validated with labelled attack datasets. The labelled attack datasets were

created while the second datasets were collected in a controlled experimental

setup. Attacks were conducted to disrupt the normal process activities of the

control system network. All of the normal activities as well as the attacks were

recorded in the logs. Later, we identi�ed the attacked logs and labelled them as

anomalous for validating the rare patterns. The rare patterns generated by the

rare sequential pattern mining algorithm from the second datasets were validated

with the labelled attack patterns.

Firstly, we show the results obtained from the First Dataset comprising

Dataset-1, Dataset-2, and Dataset-3. After that we will show the results from

the second dataset consisting of Dataset-4, Dataset-5, and Dataset-6. In the

Dataset-1, the conveyor belt control system, we found 6 rare sequential patterns

which were grouped in 5 equivalence classes out of 205 868 lines of logs. An

example is given in Table 3.12, where two rare sequential patterns SID1 and

SID2 have been identi�ed as suspicious patterns.

Table 3.12: A sample of the rare sequential patterns from conveyor belt SDB inDataset-1.

SID1 〈{HMI_Conv_Reset_− 1}〉SID2 〈{Conv_Read_Conv_Color_PE_0}, {Conv_Read_Conv_HMI

_Direction_− 1}, {Conv_Read_Conv_Present_PE_0}, {Conv_Read_Solenoid_Left_Direction_0}, {Conv_Read_Solenoid_Right_Direction_− 1}, {Conv_Run_Status_0}, {HMI_Conv_Direction_− 1}, {HIM_Conv_Master_Mode_− 1}, {HMI_Conv_Reset_− 1}〉

These two patterns (shown in Table 3.12) are also grouped in an equivalence


class since these patterns share the same support and they belong to the same

sequences in the conveyor belt SDB. The �rst pattern SID1 in Table 3.12 com-

prises of only a single event, such as {HMI_Conv_Reset_-1} which indicatesthat the conveyor belt control system has been reset or restarted from the HMI

terminal. The second pattern SID2 is comprised of 9 events. Note that, in the

equivalence class, the pattern SID1 is the minimal or the smallest rare sequen-

tial pattern while the pattern SID2 is the maximal or the largest rare sequential

pattern in the same equivalence class.

Since it was assumed that rare patterns could be suspicious, we consulted

with domain experts who were monitoring and tracking the SCADA control pro-

cess activities during the training day. Analyzing the patterns in Table 3.12, they

identi�ed 3 individual events that were anomalous in the SID2, because the val-

ues of these events {HMI_conv_reset_-1}, {Covn_Read_Conv_HMI_

Direction_-1} and {HMI_Conv_Direction_-1} were changed from the

expected values 0 to −1. As a result, these changes together caused the con-

veyor belt to sort the objects (dark and light) on the belt in the wrong direction.

However, these objects were set to move along in a prede�ned direction, which

was as a requirement for the SCADA process. Therefore, in this scenario, it is

evident that the minimal rare pattern SID1 alone could not determine that the

rare pattern was anomalous, rather it is the maximal pattern SID2 along with

other relevant events, such as {Covn_Read_Conv_HMI_Direction_-1}

and {HMI_Conv_Direction_-1} that aid the the minimal rare pattern

〈{HMI_conv_reset_-1}〉 to identify the anomalous pattern. Moreover, since

this anomalous sequence occurred during the time period when the system was

compromised. Experts con�rm it represents a cyber-attack sequence.

In Dataset-2, the pressure control dataset, 57 rare sequential patterns which

forms 38 equivalence classes out of 388 877 lines or rows of logs were found.

An example of these equivalence classes is shown in Table 3.13 where three rare

patterns form an equivalence class. The rare patterns SID1 and SID2 in Ta-

ble 3.13 are composed of only a single event, such as {Pressure_Int_2} and

{Pipe_Read_Pipeline_Pressure_2} respectively. On the other hand, the

rare pattern SID3 comprises 17 events. From these three rare sequential patterns

in Table 3.13, the experts identi�ed 3 events {Pipe_Read_Pipeline_Pressure

_2}, {Pressure_Int_2}, and {Solenoid_On_SP_Int_55} as anomalous

in the SID3 rare sequential pattern, because the solenoid pressure value has been

3.4. Evaluation 79

Table 3.13: A sample of rare sequential patterns from pressure control SDB inDataset-2.

SID1 〈{Pressure_Int_2}〉SID2 〈{Pipe_Read_Pipeline_Pressure_2}〉SID3 〈{HMI_Pipe_Pump_Off_SP_40}, {HMI_Pipe_Pump_On_

SP_5}, {HMI_Pipe_Solenoid_Off_SP_55}, {HMI_Pipe_Solenoid_On_SP_55}, {HMI_Pipe_Master_Mode_− 1}, {Pipe_Pump_Run_Status_− 1}, {Pipe_Read_Pipeline_Pressure_2},{Pipe_Read_Pump_Mode_0}, {Pipe_Read_Pump_Run_Cmd_0}, {Pipe_Read_Solenoid_Mode_0}, {Pipe_Read_Solenoid_Open_Cmd_− 1}, {Pipe_Solenoid_Open_Status_− 1}, {Pressure_Int_2}, {Pump_Off_SP_Int_40}, {Pump_On_SP_Int_5},{Solenoid_Off_SP_Int_55}, {Solenoid_On_SP_Int_55}〉

changed to 55 PSI from the HMI terminal. The changed value is above the preset

maximum pressure threshold value 40 PSI of the pressure control system. Here

also, the maximal rare pattern helps to identify the anomalous pattern rather

than the minimal rare pattern. This rare pattern also occurred during the time

period when the system was compromised. Expert knowledge con�rms this rare

anomalous sequence is a cyber-attack on the system.

Finally, in Dataset-3, the water tank control system dataset, 34 rare sequen-

tial patterns forming 34 equivalence classes were found out of 228 762 lines or

rows of logs. In one example, a rare sequential pattern comprising 10 events

shown in Table 3.14. This rare pattern itself forms an equivalence class.

Table 3.14: A sample of rare sequential patterns from water tank SDB in Dataset-3.

SID1 〈{HMI_Tank_Master_Mode_− 1}, {Tank_Level_53}, {Tank_Off_SP_Int_80}, {Tank_On_SP_Int_50}, {Tank_Read_Pump_In_Auto_0}, {Tank_Read_Pump_In_Manual_− 1}, {Tank_Read_Pump_Running_0}, {Tank_Read_Tank_Level_53}, {Tank_Stopped_0}, {Tank_Usage_Level_47}〉

When consulting with the domain experts, they identi�ed event Tank_Read_

Pump_In_Manual_-1 is an anomalous event in the rare sequence pattern

in Table 3.14, because the value of this event has been changed from 0 to −1.

However, the water tank pump was set to run in automatic mode which was

a requirement for this experiment. This rare pattern occurred during the time

when the system was compromised. This rare anomalous sequence was identi�ed


by the expert as a cyber-attack on the system.

Now, we describe the results found from the Second Dataset comprising

Dataset-4, Dataset-5, and Dataset-6. In conveyor belt SDB generated from the

Dataset-4, we found a total of 23 rare sequential patterns. These rare patterns

form 22 equivalence classes. An example of a rare pattern that forms an equiv-

Table 3.15: A sample of rare sequential patterns from conveyor belt SDB inDataset-4.

SID1 〈{Conv_Read_Conv_Color_PE_0}, {Conv_Read_Conv_HMI_Direction_− 1}, {Conv_Read_Conv_Present_PE_0}, {Conv_Read_Solenoid_Left_Direction_0}, {Conv_Read_Solenoid_Right_Direction_− 1}, {Conv_Run_Status_0}, {HMI_Conv_Direction_− 1}, {HIM_Conv_Master_Mode_− 1}, {HMI_Conv_Reset_− 1}〉

alence class which is shown in Table 3.15. The rare pattern in Table 3.15 is an

attack pattern because the even HMI_Conv_Direction_-1 value was changed

from 0 to −1. This was done while the attack was conducted to the conveyor

belt control system, and this incident was recorded in the labeled dataset as an

attack. We compared all the rare patterns found in Dataset-4 with the labeled

attack dataset to �nd which rare pattern are attack patterns. We found that 4

rare suspicious patterns were found as attack patterns out of 23 discovered rare

patterns.

Table 3.16: A sample of rare sequential patterns from pressure control SDB inDataset-5.

SID1 〈{HMI_Pipe_Pump_Off_SP_40}, {HMI_Pipe_Pump_On_SP_5}, {HMI_Pipe_Solenoid_Off_SP_25}, {HMI_Pipe_Solenoid_On_SP_40}, {HMI_Pipe_Master_Mode_− 1}, {Pipe_Pump_Run_Status_− 1}, {Pipe_Read_Pipeline_Pressure_0},{Pipe_Read_Pump_Mode_0}, {Pipe_Read_Pump_Run_Cmd_0}, {Pipe_Read_Solenoid_Mode_0}, {Pipe_Read_Solenoid_Open_Cmd_− 1}, {Pipe_Solenoid_Open_Status_− 1},{Pressure_Int_1}, {Pump_Off_SP_Int_40}, {Pump_On_SP_Int_5}, {Solenoid_Off_SP_Int_25}, {Solenoid_On_SP_Int_40}〉

In the pressure control SDB from the Dataset-5, our algorithm discovered 218

rare patterns that form 198 equivalence classes. After comparing these rare pat-

terns with the labeled attack dataset, we found 2 rare patterns as attack patterns.

3.5. Discussion and Analysis 81

An example of a rare attack pattern is given in Table 3.16. There was an at-

tack conducted in the pressure control system by changing the pressure control

system lower threshold value. The value of the event Solenoid_O�_SP_Int_25

was changed from 20 to 25 as shown in Table 3.16.

Finally, in the water tank SDB generated from Dataset-6, we found 204

rare patterns grouped in 194 equivalence classes. We could not �nd any attack

patterns in these rare patterns. This is due to the nature of the attacks. All

attacks in the water tank control system were conducted as �ooding attacks. As

a result, the attack events became frequent events and hence our method could

not �nd the attack events as rare events.

3.5 Discussion and Analysis

In this section we discuss and analyze our proposed methods. In Section 3.5.1 we

show the e�ectiveness of the equivalence classes and in Section 3.5.2 we discuss

the complexity of the two algorithms followed by the discussion of their e�ciency.

3.5.1 E�ectiveness of Equivalence Class

Our proposed general methods can generate all rare sequential patterns from a

sequence database. The ordered sequence of actions is crucial to some domains

especially a SCADA process control system where events are sequential, regular

and de�nitive. In such environments, any changes in the ordered sequence brings

about changes to the process output or end result, which is explained with an

example in Section 3.1. In the �rst experiment with the conveyor belt dataset,

our method discovered 6 rare sequential patterns. All of these patterns are

separated into di�erent groups or equivalence classes, where all the patterns in

one group have equal support value and they appear in the same sequences in

the database.

In each group the minimal rare pattern indicates the initial or �rst point of

attack. However, the maximal rare pattern sets the context that enables the

operator to decide the impact of the attack in the SCADA process, such as a

malfunction or breakdown of a SCADA system or a cyber-attack on the system.

An example is given in Table 3.12, where two rare sequences SID1 and SID2

form an equivalence class. The domain experts identi�ed 3 events in the second

pattern SID2 as problematic. All of the 3 events values were changed from 0


to −1, which caused the objects to be sorted in opposite directions, meaning

that, the light object in the left loop while the dark objects the right loop.

However, as a requirement for the experiment, the light object was moving along

the right loop while the dark object was moving along the left loop. It is worth

noting that, the minimal rare pattern SID1 in Table 3.12 indicates that the

conveyor belt was reset, but it does not provide the context from where the

domain expert can identify the pattern as anomalous. However, the maximal

rare sequential pattern in SID2, which includes the minimal pattern and other

relevant events help the domain expert to identify this pattern as anomalous as

well as an intrusive pattern.

Similarly, in the pressure control system and the water tank system dataset,

we have also discovered some rare patterns which are shown in Table 3.13 and

Table 3.14 respectively. In the pressure system, the maximum solenoid pressure

was set to 40 PSI. In the rare pattern it is found that the value was changed

to 55 PSI, which was above the maximum threshold value. Hence, the experts

identi�ed this unexpected rare pattern as an intrusive pattern. Also, in the water

tank control system, as a prerequisite requirement, the water tank system was

running in an automatic mode. The system was later changed from automatic

mode to manual mode by the attacking team, and this sequence is anomalous as

well as an intrusion.

It has been observed that checking the integrity constraint of the process

control system, we can identify the anomalous as well as attack patterns. How-

ever, we argue that the resemblance of the result of this experiment is due to

the attack pattern conducted to the control system processes. The nature of the

SCADA control process is that the changes of the values from the prede�ned set

values is considered an attack on the control system which can cause disruption

to the process. This means that changing the event's value or threshold value

can alter the process outcome and make it a rare event. For example, chang-

ing the value of the conveyor belt event {HMI_Conv_Direction} from 0 to

−1 can change the direction of the diverting gate of the conveyor belt from left

direction to right direction. As a result, the object on the conveyor belt will

move in the wrong direction. Therefore, this change is considered as an irregular

behaviour of the system, and it is purposely changed to disrupt the conveyor

belt's normal activity. Therefore, this is an attack into the conveyor belt control

system. Similarly, the change of the upper threshold value of the pressure control

3.5. Discussion and Analysis 83

event {Solenoid_On_SP_Int} from 40 PSI to 50 PSI make it a rare anomalous

event that can hamper the outcome of the process. This change of prede�ned

set pressure value is an attack on the pressure control system as this is the vi-

olation of the normal process activities. In di�erent application domains, if the

process outcome can be changed with an attack that is conducted by changing

the execution order of the events, our proposed rare sequential pattern mining

algorithm can also detect the rare sequential anomalous as well as attack pat-

tern. Therefore, we argue that it is not only the integrity constraint but also

the rare sequential pattern that can detect the rare anomalies and attacks in a

system. In the SCADA domain datasets, it is found that we do not need all rare

patterns, rather investigating the maximal rare patterns in a group was su�cient

to �nd anomalies as well as intrusions. However, it could be possible that for

di�erent application domains, minimal rare patterns or all rare patterns in a

group could be helpful to detect anomalies in a system. Therefore, distributing

all rare patterns in di�erent groups is found e�ective.

3.5.2 Computational Complexity

We assume N be the number of sequences in the sequence database SDB, M be

the size of the maximum sequence in SDB, and L be the maximum number of se-

quences in CSGs. For Algorithm 3.1, the while-loop could iterateM times in the

worse case. Inside the while-loop, the database scan at lines 20-24 goes through

N sequences and the for-loop would iterate L times. So, the complexity of Algo-

rithm 3.1 would beM ∗L∗N . The candidate sequence generator CSGs contains

the combinations of sequences in FSGs with common pre�x of size(s−1). Thenumber of patterns in CSGs is usually very small, that is, L << N . There-

fore, by ignoring the less signi�cant term, the complexity of the �rst algorithm,

Algorithm 3.1, is O(M ∗N).

In Algorithm 3.2, RSPs is the set of rare sequential patterns of size-s. The

maximum size of RSPs is at most L. Events(SDB) contains all the unique

events in the database. For an application domain, events are relatively sta-

ble, so the size of Events(SDB) can be considered as a constant number. Let

|Events(SDB)| = E. For Algorithm 3.2, the while-loop iterates M times at

most. Line 10 would take |S| times to add e into S to generate |S| patterns, inthe worst case, |S| = M . Inside the while-loop, the �rst nested for-loop at lines 8-

11 would take E ∗L∗M times. This means, the size of the resulting set CRSP of


candidate rare sequence patterns is at most E∗L∗M . The second for-loop at line

13-17 would take E∗L∗M time in the worst case. Similarly, the third for-loop at

lines 19-32 would take E∗L∗M time as well. By taking the database scan at line

21-26 into consideration, the overall computational complexity can be calculated

asM(E ∗L∗M+E ∗L∗M+E ∗L∗M ∗N) = 2E ∗L∗M2+E ∗L∗M2∗N . Given

that E is a constant, the complexity can be O(L∗M2+L∗M2∗N). Further, for a

large log dataset, usually we have L << N and alsoM << N , it is reasonable to

consider that L ∗M < N . In this case, the complexity of the second algorithm,

Algorithm 3.2, is O(M2 ∗N).

The e�ciency of Algorithm 3.1 can be also justi�ed by the way it generates

rare patterns. Our proposed method did not generate all frequent patterns, in-

stead the method only generates frequent generators to �nd all rare generator

patterns. As such, it requires to explore only a subset of frequent patterns, that

is, frequent generators. Otherwise, all frequent patterns are needed to be gener-

ated which would cost large amounts of memory as well as search time. Moreover,

we compared our algorithm with the method proposed by Gao et al. [107] FEAT

(Frequent sEquence generATor) miner algorithm to �nd if our algorithm is gen-

erating all the frequent sequence generators. We ran the FEAT algorithm using

SPMF [108] (Sequential Pattern Mining Framework). We found that the number

of frequent generators produced by the FEAT algorithm on di�erent databases

(shown in Table 3.17) are the same number of frequent generators that are gen-

erated from our algorithm.

Table 3.17: Comparison among the databases regarding the number of frequentgenerators our algorithm produced and the number of frequent generators pro-duced by FEAT algorithm.

DatasetDatabase

SDB

#Sequences

in Database

# Freq. Gen

Our Algorithm

# Freq. Gen

FEAT Algorithm

First

Dataset

Conveyor belt 22875 538 538

Pressure control 22876 4572 4572

Water tank 22877 3919 3919

Second

Dataset

Conveyor belt 12117 427 427

Pressure control 27904 4043 4043

Water tank 23599 2961 2961

For example, from the conveyor belt SDB in the �rst dataset, our proposed al-

3.6. Related Work 85

gorithm generated 538 number of frequent generators from 22 875 number of

sequences. In comparison, Gao et al.'s [107] FEAT algorithm has also produced

the same number of frequent generators. Table 3.17 shows all the frequent gener-

ators produced by our method in comparison to the FEAT algorithm. Also, note

that, in our proposed second method, Algorithm 3.2, does not generate unwanted

patterns that can be originated from non-existent patterns, which saves search

space and time. An example of a non-existent pattern 〈{c}, {a}〉 that is shownin the lattice in Figure 3.1, from which no further candidate sequence patterns

can be generated.

3.6 Related Work

Anomaly or outlier detection is de�ned as �an observation that deviates so much

from other observations as to arouse suspicion that it was generated by a dif-

ferent mechanism� [109]. Malicious events or intrusions can be detected using

anomaly detection techniques [22]. The anomaly detection method was originally

proposed by Denning [23] and since then this method has been used in computer

security in general and for intrusion detection in particular [24]. Balducelli et

al. [86] attempted to �nd anomalous or abnormal behavior using a case-based

reasoning method. They compared a sequence of SCADA log events with the

previously de�ned normal behavior. In general, algorithms used for anomaly

detection need to have normal operation data, also called labeled data, to build

a training model. These algorithms generally consider anomalies as patterns

that have not been seen before in normal or regular behavioural patterns of a

system [99] [101]. Data mining based anomaly detection techniques can be used

for both signature and behaviour based anomaly detection techniques [29] which

can further be classi�ed as (i) supervised methods, (ii) semi-supervised methods

and (iii) unsupervised methods [110].

So far, there have been few works using data mining to �nding anomalies.

Manganaris et al. [26] showed that the absence of a frequent event or set of

events can be considered as an anomaly. Clifton et al. [27] applied a data min-

ing technique to identify normal behaviour of a system based on the frequent

occurrence of an alarm event which was �ltered out later from suspicious event

lists. Barbara et al. [95] built models of systems and users' normal behaviour

using data mining association rules from network tra�c data. Later in the de-


tection model they looked for any deviation in association rules considered as

abnormal behaviour thus an anomaly of the system and users. Fan et al. [30]

used a signature based ANN classi�er to detect malicious sequential patterns

from a sequence of machine instructions. Their method inherently lacks the

ability to identify new malware which has no previous signature trained in their

method. Hadºiosmanovi£ et al. [17] applied frequent itemset mining to �nd a

rare event by changing di�erent support values in SCADA process logs. They

could only identify a single rare event or item which the stakesholders identi�ed

as potential vulnerabilities. But, they could only identify a single event rather

than a sequence of anomalous events. Lee and Stolfo [111] pre-labeled regular

system calls as normal sequences, then they looked for an abnormal sequence not

found in the normal list. But �nding all normal sequences is almost impossible.

Julisch and Dacier [28] used the pattern mining episode rules technique to reduce

irrelevant alarm signals using false positives from historical alarms. These meth-

ods depend on signature based rules which inherently lack the ability to identify

new attacks.

Saha et al. [97] mentions the basic strategy for rare or infrequent pattern

mining is to identify all the frequent patterns from a transaction database using

a user provided minimum support threshold value and later prune these pat-

terns from the database. As a result, the remaining patterns which fall below

the threshold value are considered as rare patterns. Szathmary et al. [98] dis-

covered rare itemsets by identifying minimal rare itemset generators. The main

idea behind their approach was to identify individual frequent items. If the

combination of these frequent items becomes infrequent then this combination

is considered a rare itemset. Their work focused on rare itemset mining which

ignores the order of the events, but we argue that ordered rare events may help

to �nd anomalous patterns. Shengyi et al. [112] applied a data mining approach

called common path algorithm where they labeled attacking data based on some

assumed rules. As attackers are constantly using updated techniques, they can

bypass this signature based method. So, it cannot �nd unknown attacks that do

not match the de�ned rules.

Until now, there has been limited research in rare itemset mining. Among

these Szathmary et al. [98] is a pioneer, where the authors starting from frequent

generators derive rare itemset generators by merging frequent itemsets or events.

For example, two individual frequent events {A} and {B}, when merged together

3.6. Related Work 87

become {A,B} which turns out to be a rare event. However, this work did not

consider the order of events. Szathmary et al.'s approach cannot be applied

to �nd anomalies in a system where the order of events is signi�cant like a

SCADA system. In particular, SCADA systems are often used to control critical

infrastructure, and detecting an anomaly is of the highest importance. The

process failure in a SCADA system may physically damage equipment and human

life. In addition, Szathmary et al.'s algorithm cannot maintain the integrity of

the original pattern, meaning the order of the event is lost in their method.

As a result, causal relationship among the events cannot be established. For

example, the rare itemset pattern {Fan-Failure, Device-Down} does not provide

information as to which event causes to trigger other event.

Although there have been a limited number of works in �nding rare itemset

mining, they did not address the order of itemsets or events. However, our re-

search is focusing on sequential patterns �rst introduced by Agrawal and Srikant

[50] where the event's execution sequence order remain unchanged. As any al-

teration of events in the execution order represents a deviation of a prede�ned

sequence order which may be an indication of anomalous events considered to be

rare in a system. Rahman et al. [102] have shown that the sequence of ordered

events is signi�cant to detect anomalies in SCADA logs using rare sequential pat-

tern mining. However, this work only focused on the minimal or smallest rare

sequential patterns and could not �nd all the rare sequential patterns. There-

fore, we extend the previous work to �nd all rare sequential patterns and their

equivalence classes to identify anomalies e�ectively from a sequential database.

Although the proposed rare sequential pattern mining algorithm can e�ectively

detect rare anomalous pattern, this method cannot be applicable to �nd anoma-

lous pattern in time gapped control logs. This means that the system activities

are not recorded continuously in the log �le, rather logs are recorded only when

there happens an event on the system. The proposed method does not consider

the time gap while generating rare sequential pattern. In addition, the proposed

method cannot be used to predict possible anomalies, meaning that the method

cannot predict a possible anomaly in a live system.


3.7 Summary

In this chapter we have proposed and developed a new method for �nding anoma-

lies in SCADA control system. We proposed a rare sequential pattern mining

algorithm for �nding anomalies. In this thesis it is assumed that anomalies hap-

pen rarely in a system, so a rare pattern could represent an anomalous pattern

in a system. As activities of a SCADA system are limited and repetitive, an

anomalous pattern on the SCADA system would be rare. In addition, the ac-

tivities of a SCADA system occurs sequentially which has been the motivation

to use rare sequential pattern to detect anomalies on a SCADA control system.

we discussed a novel minimal and maximal rare sequential pattern mining ap-

proach for anomaly detection. To �nd e�ective patterns, all rare patterns sharing

the same support value are put into di�erent groups. The smallest pattern in

each group is the minimal rare pattern while the largest pattern is the maximal

rare pattern. In each group, the minimal patterns, the maximal patterns, and

other patterns in between the minimal and the maximal could be used to �nd

anomalies depending on the preference of the application domains.

We evaluated our method using Supervisory Control and Data Acquisition

(SCADA) control system log data containing cyber-incidents. The identi�ed

rare anomalous sequences were intrusions or attacks on the system, demonstrat-

ing the usefulness of our rare sequential pattern mining approach. We applied

our proposed methods on three SCADA control system log datasets. With every

dataset it was found that some of the rare patterns identi�ed as suspicious were

later revealed as intrusive patterns. The maximal rare patterns were more e�ec-

tive in identifying malicious anomalous patterns than the minimal rare patterns.

However, it could be that the minimal pattern or all rare patters in a group that

would be e�ective in identifying anomalies based on domain requirements.

The rare sequential pattern mining method discussed in this chapter �nds

rare pattern from the logs which are recorded continuously on the SCADA con-

trol system. However, in some application domain events are recorded with an

inde�nite time-span gap. It means that when there is an event then it is recorded

in the logs otherwise the events are not recorded. Therefore, there exist time-

span gaps among the events in the logs. As using the proposed rare sequential

pattern mining algorithm the anomalies cannot be detected from the time-span

gapped control logs, we propose another method to �nd rare patterns from time-

span gapped control logs. In the next chapter, Chapter 4, we discuss the method

3.7. Summary 89

of constraint-based rare sequential pattern mining technique. If patterns are

generated from two time windows having a time gap, it might not carry the

signi�cance towards �nding anomalies. So, to �nd anomalies from signi�cant

rare patterns we need time constraint-based pattern mining. In addition, the

proposed method can only detect anomalies from static SCADA control logs.

However, the method cannot used to predict anomalies from the streaming logs

in SCADA live system. Therefore, we a propose a rare sequential association

rules mining method which uses SCADA streaming logs for anomaly prediction.

Chapter 4

Constraint-based Rare Sequential

Pattern Mining

4.1 Introduction

Pattern mining and especially sequential pattern mining produces overwhelming

numbers of patterns. This makes the sequential mining process ine�cient and

ine�ective. It becomes ine�cient because the search space becomes exponential

and ine�ective due to extraction of knowledge from large number of patterns. As

a result, most of the patterns become useless to the users [113]. It is in fact the

users who determine the interesting and useful patterns on a system. They use

di�erent constraints or restrictions in the data source as well as in the mining

process to discover the useful required patterns in an e�cient and e�ective man-

ner [114] [115] [116] [117]. It is found in the literature that constraints have been

accepted as the most common and e�ective approach to control large number of

discovered patterns. There have been many approaches to constraint-based pat-

tern mining explored to achieve e�ciency and e�ectiveness while discovering the

useful and interesting patterns. These approaches apply not only the semantics

of the domain knowledge but also apply the interest of the system users.

The constraint-based pattern mining can be categorized into several groups

based on their applications. Pei et al. [45] de�nes the constraint into seven

categories, such as (i) Item constraint, (ii) Length constraint, (iii) Super-pattern

constraint, (iv) Aggregate constrain, (v) Regular expression constraint, (vi) Du-

90


ration constraint, and (vii) Gap constraint. All of these constraints can be ap-

plied with three mechanism (a) Pre-processing or dataset �ltering constraint,

(b) Pattern �ltering or mining process constraint and (c) Post-processing con-

straints [118] [76]. In the pre-processing step, constraints can be enabled at the

data source to �lter and organize the dataset so that the user desired patterns

can be obtained after applying the data mining methods. In the pattern �ltering

process, constraints are imposed by modifying the actual mining process algo-

rithm. This process can make the data mining process more e�cient by reducing

the search space that requires less time to �nd the desired results. In the post-

processing step, constraints are used after the standard mining process discovers

the results. In this method, any number of constraints could be used to keep

or extract the users' demanded patterns and �lter out the unwanted patterns.

The post-processing method is unsatisfactory because it wastes computational

time in producing unwanted patterns from the users' perspective, then it �lters

out the unnecessary patterns [116]. Therefore, this method does not focus on

improving the e�ciency or performance of the data mining algorithm.

The rare sequential pattern mining is also challenging with respect to setting

the threshold value. The number of rare sequential patterns could be large if the

threshold value is set to a higher value. In addition, if the average size of the

sequences are large in the database SDB, the possibility of generating the large

number of rare sequential patterns increases. Moreover, if the number of unique

events increases in the SDB, the number of candidate sequential patterns also

increases. The large candidate sequential patterns take more computational time

in the data mining process. Furthermore, if the database size increases, meaning

the number of sequences in the SDB is large, then it also attributes to increase

the computational time of the mining process. Therefore, these three factors the

size of the database, the size of sequences in the database, and the number of

unique events in the database make it di�cult to identify the rare suspicious

anomalous patterns. These factors also contribute to increase the computational

time signi�cantly. Hence, there is a need to apply constraints to generate less

number of rare sequential patterns which reduces the computational time to

identify rare suspicious anomalous patterns.

92 Chapter 4. Constraint-based Rare Sequential Pattern Mining

4.1.1 Motivation

We have analyzed the control logs of a real life SCADA controlled electrical

power distribution substation. It has been observed that when there occurs some

activities or events in the system, the events are recorded in a log �le. However,

when there happens no events in the system, nothing is recorded in the log �le.

As a result, in the log �le we �nd an episode of recorded events for a certain time-

span period and there is a time gap during which there exists no recorded events

in the logs. A meaningful or signi�cant pattern can be generated from an episode

of events in a de�ned time-span period. However, if a pattern overlaps beyond

the de�ned consecutive time-span period, it may not be a signi�cant pattern.

This is because in an episode of events during the de�ned time-span period,

some sequence of events accomplish a complete a task or process. Therefore, if

a pattern is derived from several consecutive episodic events, which exceeds the

de�ned time-span period, the pattern cannot be a signi�cant pattern; rather a

misleading pattern.

Therefore, the �rst motivation is to integrate the time-span constraint while

selecting the sequences from the data source, the raw logs. The time-span con-

straint is applied during the data pre-processing stage. As a result, the signi�cant

discovered patterns can only be extracted from a de�ned time-span period. It

means that the time duration between the �rst event and the last event in a

pattern must satisfy the user de�ned maximum time-span threshold value [119].

For example, let's consider the pattern 〈{a}, {c}, {d}〉 appears to be rare (SID2,SID5 ) in the sequence database SDB shown in Table 4.1 when the maximum

support threshold value maxsup is set to 2. It is di�cult to distinguish between

these two rare patterns and take decision as to which rare pattern is more reliable

to judge it a suspicious anomalous pattern.

Table 4.1: A sequential database SDB with events' occurrence time-stamp.

Sequence ID SequencesSID1 〈{c}1, {b, d}6 〉SID2 〈{a}12, {e}13, {c}14, {f}16, {d}20 〉SID3 〈{b}26, {f, g}27 〉SID4 〈{g}33, {d}38 〉SID5 〈{a}45, {e}47, {c}48, {d}50 〉

However, if we further analyze these two patterns considering the time-span

duration, then the pattern in SID5 occurred within a de�ned time-span, say in


5-minutes time-span duration. It is assumed that this pattern is more capable

to do harm on the system than the pattern in SID2. It means that the SID5

pattern is more signi�cant than SID2, because the events in the SID5 pattern are

performed within a de�ned time-span period. So, the pattern has the potential

to do harm on the system since it remains active for a de�ned time-span period.

Otherwise the pattern weakens its potential or strength to do harm on a system

and hence considered as a less signi�cant pattern.

The second motivation is to avoid unwanted database scanning while gener-

ating rare sequential pattern. The avoidable unwanted database scanning can

be achieved by integrating constraint in the actual rare sequential pattern min-

ing algorithm. The constraint, which is also called the algorithmic constraint,

prohibits to search a candidate sequence which size is larger than the size of

a sequence in the database. For example, assume that 〈{c}, {b}, {a}, {c}〉 is acandidate sequence and 〈{b, c}, {b}, {a}〉 is a sequence in the database. The al-

gorithmic constraint compares the size of the candidate sequence with the size

of the sequence in the database. If the size of the candidate sequence is larger

than the size of the sequence in the database, it is not possible the candidate

sequence can be found in the database. In this example, the size of the sequence

〈{b, c}, {b}, {a}〉, which is 3, is smaller than the size of the candidate sequence

〈{c}, {b}, {a}, {c}〉, which is 4. So, the candidate sequence cannot be found in

the database sequence. This is because the candidate sequence is always be a

subsequence of a sequence in the database. Hence, it is not bene�ting to look

for the candidate sequence in the database sequence. Therefore, the algorith-

mic constraint can reduce the computational time of the rare sequential pattern

mining process.

The third motivation for integrating the constraints into the rare sequential

pattern mining is reducing the number of features. The feature reduction con-

straint bene�ts to reduce the number of unique events in the database. The

reduced unique events minimize the computation time and search space while

mining the rare sequential patterns from the database. For example, assume that

a rare sequential pattern 〈{a}, {b}, {c}〉 of size-3 and a sequence database SDB

that has 38 unique events. When all possible candidate super sequential patterns

of size-4 are generated from a size-3 rare sequential pattern, it generates a total

of 4 ∗ 38 candidate super sequential patterns. In addition, if there exists 20 rare

sequential patterns, the total number of candidate super sequential patterns that


are generated is 20 ∗ 4 ∗ 38. In general, for N number of rare sequential patterns

of size-n with M number of unique events in a database generates N ∗(n+1)∗Mcandidate super sequential patterns of size-(n+1). Furthermore, if the average

size of the sequences in the database SDB are long, the number of level-wise

(the increase of sequence size) generated candidate super sequential patterns can

be large. Therefore, we argue that the generation of candidate super sequential

patterns can be reduced by keeping the average sequence size small and reducing

the number of unique events in the database. The number of unique events and

the average size of the sequences can be reduced by integrating feature reduction

constraint in the database.

4.2 Existing Related Work

To understand how constraints are used in the pattern mining, in this section we

provide the existing works as to constraint-based pattern mining. In sequential

pattern mining the use of constraint is �rst introduced by Srikant and Agrawal

[66]. They introduced time constraint, such as minimum gap between two suc-

cessive events, maximum gap between two consecutive events and sliding time

window that relaxes conventional sequential pattern mining process in GSP with

Apriori framework. Later many methods have been proposed in constraint-based

pattern mining to achieve e�ectiveness and e�ciency in mining sequential pat-

terns interested to the users need. Garofalakis et al. [120] proposed a family

of four algorithms called SPIRIT where di�erent regular expressions R are used

as constraints for mining frequent sequential patterns that satis�es a given reg-

ular expression constraint. For example, SPIRIT(N) only keeps the candidate

sequence patterns elements that are de�ned by the constraint R. In other words,

candidate sequence patterns are pruned when they do not satisfy constraint R.

The constraints are used inside the mining process, meaning during the mining

process constraints are enforced.

Parthasarathy et al. [121] uses constraint in post-processing. Zaki et al. [77]

integrates a variety of syntactic constraints into cSPADE algorithm to mine

frequent sequences. These constraints are length or width restrictions, gap lim-

itations on the consecutive events on a sequence, time window restriction of

occurring a whole sequence, and item constraints limiting the inclusion or exclu-

sion of de�ned items in a sequence. Authors imposed these constraints inside the

4.3. Preliminaries 95

mining process. Desai and Ganatra [122] e�ectively applies di�erent constraints

like Gap, Compactness (Time span), Item, Recency, Pro�tability and Length to

understand the purchasing behavior of customers. Antunes and Oliveira [123]

introduced gap constraint in the generalization of the Pre�xSpan (GenPre�xS-

pan) algorithm. They have shown that the gap constraint is applicable to the

long sequence like bioinformatics sequence [124]. In the literature it has been

shown that in the pattern mining process the computational processing time can

be reduced signi�cantly by applying the constraints [74] [125]. It is also proved

that constraint-based pattern mining can e�ectively reduce a large search space

when applying in sequential pattern mining [45].

In the above discussion it is found that all of the constraints studied in the

literature are based on mining the frequent patterns. However, there exists no

prior works that applies the constraints to mine the rare sequential patterns

by focusing the users' interests and the semantics of the SCADA domain. To

improve the e�ciency and e�ectiveness of the proposed rare sequential pattern

mining method, we integrate constraints in the rare sequential pattern mining

approach.

4.3 Preliminaries

This section presents the background knowledge of di�erent constraints that are

integrated in the proposed rare sequential pattern mining algorithm. The follow-

ing are widely used constraints from applications point of view used by the users

to produce only the patterns of their interests and discard unwanted patterns.

Some of the constraining concepts are used in the literature such as in [126]

[122]. These constraints are widely used with frequent sequential pattern min-

ing. However, we have integrated the following constraints in our proposed rare

sequential pattern mining algorithm. The objective of the constraint-based rare

sequential pattern mining is to ensure that the important patterns are identi�ed

and the unwanted patterns are ignored. Let, I be a set of items, according to

Pei et al. [72] a constraint C is de�ned as predicate on the powerset of I, that is,

C : 2I ⇒ {true, false}. A sequence S satis�es a constraint C if and only if C(S)

is true. The problem of constrained rare sequential pattern mining is to �nd all

rare patterns in a sequence database SDB that satis�es the constraint, that is, if

(sup(S) ≤ σ) ∧ C(S) = true, where σ is the maximum support threshold value


maxsup.

This research has integrated the following constraints in the rare sequential

pattern mining process, described in Chapter 3, to achieve the three goals, which

is described in Section 4.1.1.

Constraint 1 (Time-span duration): The time-span constraint is de�ned by

calculating the timestamp di�erence between the �rst and the last events in a

discovered sequential pattern. This is similar to the approach used by Zhu et

al. [127] where the authors applies session �lters to mine web sequential pattern.

The time duration must be within the given time period. Let, S = 〈A1, A2,

..., An〉 be a sequence, Ai.time be the timestamp of Ai, Dur (S) = An.time −A1.time be the duration of S, the time-span duration constraint is de�ned as :

CTS(S) ≡ Dur(S) ≤ ∆t, where ∆t is an integer.

For example, the pattern 〈{a}, {c}, {d}〉 appearing in SID2 and SID5 shown

in Table 4.1 is a rare sequential pattern when the maximum support threshold

value maxsup is set to 2. However, if the time-span constraint ∆t = 5, then the

pattern in SID5 is valid rare sequential pattern, because this sequence time-span

is within the set time-span constraint ∆t. The time di�erence between the last

event {d} and the �rst event {a} of the pattern 〈{a}, {c}, {d}〉 is 5 − 1 = 4,

which is below the de�ned time-span constraint ∆t = 5. However, the same

pattern which appeared in SID2 is not a valid rare sequential pattern. This is

because the pattern's time-span period 10− 1 = 9 exceeds the de�ned time-span

constraint ∆t = 5.

Constraint 2 (Pattern size): The pattern size constraint is de�ned by com-

paring the size of a candidate sequence with the size of a sequence in a sequential

database SDB. Let, α and γ be two sequences, where α represents a candidate

sequence and γ represents a sequence in SDB. The pattern size constraint is de-

�ned as:

Csize(α, γ) ≡ Size(α) ≤ Size(γ), where Size(α) and Size(γ) are integers.

For example, the candidate sequence 〈{a}, {c}, {d}〉 of size-3 can only be found


in the SDB in Table 4.1 that has a sequence size larger than or equal to the

size of the candidate sequence. The candidate sequence can only be found in the

sequences SID2 and SID5 because the size of these two sequences size-5 and

size-4 respectively are larger than the size of candidate sequence size-3. How-

ever, the candidate sequence cannot be found in the sequences SID1, SID3, and

SID4 because the size of these sequences size-2, size-2, and size-2 respectively

are smaller than the size of the candidate sequence size-3.

Constraint 3 (Pattern existence): The pattern existence constraint is de�ned

by comparing the support of a candidate sequence with the maximum support

threshold value maxsup. Let, α be a candidate sequence and maxsup is a maxi-

mum support threshold value for �nding rare sequential pattern in a sequential

database. The pattern existence constraint is de�ned as:

CPE ≡ Sup(α) = maxsup, where Sup(α) and maxsup are integers.

For example, 〈{a}, {c}, {f}〉 is a candidate sequence and the maximum support

threshold value maxsup is set to 1. The candidate sequence support is 1 which

is equal to the maximum support threshold value maxsup. The candidate se-

quence's support value becomes equal to the maximum support threshold value

maxsup at the sequences SID2 shown in Table 4.1. So, it is unnecessary to scan

the remaining sequences after the sequence SID2 in the database SDB. There-

fore, unwanted scanning of the sequences SID3, SID4, and SID5 can be avoided.

Constraint 4 (Feature reduction): In this process, we select the important

features, while the unimportant features are not selected during the data pre-

processing stage. The unimportant feature means the feature which value cannot

be changed or the feature which is not required to conduct attacks on the system.

We formulated the feature selection rules to select the important features and

to drop the insigni�cant features while identifying the anomalies. The features

which carry the signi�cant information needed to identify the anomalous pattern

are selected for the experiment. It means that the features' values which can

be changed to conduct attacks on the control system are selected the proposed

experiment. For example, an attack can be conducted by changing the value of

the feature Tank_Read_Pump_In_Manual to operate the water tank control


system in manual mode, rater than in automatic mode.

4.4 Constraint-based Rare Sequential Pattern Min-

ing Algorithm

In this section, we discuss how the constraints are integrated by modifying our

proposed rare sequential pattern mining algorithms (Algorithm 3.1 and Algo-

rithm 3.2) presented in Chapter 3. The Algorithm 3.1 which generates rare

sequential generator patterns cannot be modi�ed to integrate algorithmic pat-

tern existence constraint during the rare pattern mining process. This is because

in �nding rare sequential generator patterns, it is required to check whether any

candidate sequence patterns is either a rare or a frequent pattern. However, the

pattern existence constraint is used to stop unwanted scanning the database once

the support of the candidate sequence pattern touches the maximum support

threshold value. But during the generation of generator patterns it is possible

that a candidate sequence could become a frequent pattern, whose support val-

ued can exceed the maximum support threshold value maxsup. Therefore, it is

not possible to integrate the pattern existence constraint during the generation

of the rare sequential generators.

However, another algorithmic constraint, the pattern size, can be integrated

while generating rare sequential generators with the Algorithm 3.1. The is be-

cause there exists the possibility of having a candidate sequence which size could

be larger than the size of a sequence in the database SDB. Therefore, the pat-

tern size constraint is used to generate constraint-based rare sequential genera-

tors which is given in the Algorithm 4.1. In addition, the constraint-based rare

sequential pattern mining algorithm, which is given in the Algorithm 4.2, inte-

grates both of the algorithmic constrains: the pattern size constraint and the

pattern existence constraint. These two constraints are integrated by modifying

the rare sequential pattern mining algorithm (Algorithm 3.2), which is described

in Chapter 3.

The goal of the constraint-based rare sequential pattern mining algorithms

(Algorithm 4.1 and Algorithm 4.2) is to �nd the user interested rare sequential

patterns and remove the unwanted patterns. As a result, this constraint-based

method helps to �nd anomalies from the reduced number of rare sequential pat-

terns. In other words, the reduced rare sequential patterns help to �nd anomaly

4.4. Constraint-based Rare Sequential Pattern Mining Algorithm 99

in an e�ective manner. It means �nding anomalies from a small number of

rare sequential patterns instead of �nding anomalies from a large number of

rare sequential patterns. In addition, the constraint-based algorithms can �nd

anomalies in an e�cient manner by reducing the computation time while gen-

erating rare sequential patterns. This is achieved by generating only the user

interested rare sequential patterns instead of all the rare patterns. Further, the

constraint-based rare sequential pattern mining algorithms provides the system

operators with less number of rare sequential patterns to identify the anomalies.

The less number of rare patterns also reduces the false positives because the

anomalies are identi�ed from less number of rare patterns.

4.4.1 Generating Constrained Rare Sequential Generator

Patterns

The algorithm for generating constrained rare sequential generators (Algorithm

4.1) is a modi�ed version of the Algorithm 3.1 discussed in Chapter 3. The

Algorithm 4.1 generates minimal rare sequential generators from a sequential

database SDB. In this process, �rstly, the size-1 generators are found by sep-

arating the Events(SDB) into rare and frequent zones based on their support

value. The size-1 events, also called candidate sequence generators CSG1, are

separated into the rare zone and the frequent zone. The CSG1 are separated into

the rare zone when their support value is below or equal to the maximum sup-

port threshold value maxsup. On the other hand, the CSG1 are separated into

the frequent zone when their support value is larger than the maximum support

threshold value maxsup. The candidate sequence generators CSG1 in the rare

zone are called size-1 rare generators, while the candidate sequence generators

CSG1 in the frequent zone are called size-1 frequent generators. This separation

is done in step 1-12 of the Algorithm 4.1.

After generating size-1 rare generators, we need to �nd other rare generators

that have larger size than the size-1 rare generators. This larger sized generators

are generated by merging frequent generators starting from the size-1 frequent

generators. The merging of frequent generators of size-s are done to generate

size-(s+1) candidate sequence generators. The candidate sequence generator

of size-(s+1) are generated by merging two size-1 generators while keeping the

common pre�x subsequence of size-(s−1). For example, the following are two

frequent generators of size-2 〈{a}, {b}〉 and 〈{a}, {c}〉. These two generators


Algorithm 4.1: Generating Constraint-based Rare Sequential Genera-tor PatternsInput: A sequential database SDB, maxsupOutput: Constrained Rare Sequential Generator Patterns (RSG)

1 CSG1 ← {〈e〉|∀e ∈ Events(SDB)} // Here CSG1 is a set of candidatesequence generators with size-1 sequences

2 FSG1 ← {}, RSG1 ← {} // Here FSG1 and RSG1 is a set of frequentsequential generator and rare sequential generator respectively

3 S.supp0 ← |SDB|, ∀S ∈ CSG1

4 Count support S.supp1 of each sequence S in CSG1 by scanning SDB5 for S ∈ CSG1 do6 if S.supp1 = S.supp0 then7 remove S from CSG1

8 else9 if S.supp1 > maxsup then10 FSG1 ← FSG1 ∪ {S}11 else12 RSG1 ← RSG1 ∪ {S}

13 s ← 214 FSGs ← {}, RSGs ← {}15 while FSGs−1 not empty do16 CSGs ← all possible combinations of two sequences with common

pre�x of size(s−2) subsequences in FSGs−117 for S ∈ CSGs do18 ms ← minimum support of the size(s−1) subsequences of S19 S.supps ← 020 for a ∈ SDB and Csize(S, a) is true do21 if S v a then22 S.supps ← S.supps+123 else24 continue

25 if S.supps = ms then26 remove S from CSGs

27 else28 if S.supps > maxsup then29 FSGs ← FSGs ∪ {S}30 else31 RSGs ← RSGs ∪ {S}

32 s ← s+1

33 return RSG = RSG1 ∪RSG2...RSGs−1


have the common pre�x subsequence 〈{a}〉. There are two candidate sequence

generators of size-3 that are generated from the two frequent generators of size-2.

In this candidate sequence generation process, keeping the common pre�x sub-

sequence unchanged the remaining su�x sequences are merged in both forward

and reverse directions. The candidate sequence generators are 〈{a}, {b}, {c}〉 and〈{a}, {c}, {b}〉, which are generated at step 16 of the Algorithm 4.1.

Once the candidate sequence generators are generated, the candidate se-

quence is scanned in the database to check if the candidate sequence exists in the

database. While searching the candidate sequence in the database, the pattern

size constraint is applied in the Algorithm 4.1, which is shown at step 20-24.

The pattern size constraint was not used in the rare sequential pattern mining

Algorithm 3.1. This constraint enable the Algorithm 4.1 to skip the unwanted

sequences to scan in the SDB. If the size of a candidate sequence is larger than

the size of a SDB sequence, then the SDB sequence is skipped to the next se-

quence. For example, to look for the candidate sequence 〈{a}, {e}, {f}〉 in the

SDB shown in Table 4.1, the sequences SID1, SID3, SID4 are skipped. This is

because the size of the candidate sequence, size-3, is larger than the size of the

sequences SID1, SID3, SID4, which is size-2 for all of these three sequences. The

sequences SID2 and SID5 are only searched for the candidate sequence. Since, 3

sequences are skipped from 5 sequences in the database, the computational time

can be reduced upto 60% while �nding the candidate sequence.

The candidate sequence which has smaller size than the size of a sequence

in the database can be found either as a rare sequential generator or a frequent

sequential generator depending on the candidate sequence's support value. This

is done at step 18-26 of the Algorithm 4.1. The entire process, shown in step

15-27, of �nding the rare sequential generators continues until no more frequent

sequential generators to process, that is, FSGs becomes empty. At the end of

the process, the constrained rare sequential generators are collected, which is

shown at step 28 of the Algorithm 4.1.

4.4.2 Generating Constrained Rare Sequential Patterns

In this phase, constrained rare sequential patterns are generated from the con-

strained rare sequential generators generated by the Algorithm 4.1. The proposed

method of generating constrained rare sequential patterns is described in Algo-

rithm 4.2. This algorithm is a modi�ed version of the Algorithm 3.2 described


in Chapter 3. The procedure starts with the size-1 rare sequential generators as

shown at step 3 of the Algorithm 4.2. Starting from the size-1 rare generators,

at each size-s, all possible candidate rare sequential patterns CRSPs+1 of size-

(s+1) are generated, which is shown at steps 6-10 of the Algorithm 4.2. For each

rare sequential pattern of size-s, the candidate sequence patterns are generated

by extending the rare sequential pattern with each size-1 event of the database.

Each event is placed in every possible position of the rare sequential pattern.

For example, 〈{a}, {b}, {c}〉 is a rare sequential pattern of size-3 and {g} is asize-1 event. From the size-3 rare sequential pattern, four candidate sequential

patterns are generated by placing the event {g} in four di�erent positions, re-

spectively. The generated candidate sequential patterns are 〈{g}, {a}, {b}, {c}〉;〈{a}, {g}, {b}, {c}〉; 〈{a}, {b}, {g}, {c}〉 and 〈{a}, {b}, {c}, {g}〉.

Not all of the generated candidate sequential patterns are rare sequential

patterns. Among these candidate sequential patterns, some are rare sequential

patterns and others are non-existent patterns. The candidate sequential patterns

must be infrequent patterns, rare and non-existence, as the candidate sequential

patterns are generated from the rare sequential patterns. While the candidate

rare sequential patterns CRSPs+1 are generated, it is ensured that candidate se-

quential patterns which contain any non-existent patterns. This is because any

candidate rare sequential pattern that could be generated from the non-existent

pattern become a non-existent pattern. On the other hand, the candidate rare

sequential patterns which are not generated from the non-existent pattern are

further checked to �nd the rare sequential patterns. If candidate rare sequen-

tial patterns CRSPs+1 are found, the patterns are added to the rare sequential

pattern RSP. If candidate rare sequential patterns CRSPs+1 are not found, the

patterns are added to NEP so that no subsequent CRSPs+1 patterns can be

generated. This is shown at steps 13-17 of Algorithm 4.2.

The algorithmic constraints, the pattern size constraint and the pattern ex-

istence, are applied while searching the candidate rare sequential patterns in

CRSPs+1 as shown at steps 21-26 of the Algorithm 4.2. The pattern size con-

straint checks if the size of a candidate rare sequential pattern in CRSPs+1 is

larger than the size of a sequence in the database. If the size of the candidate

sequence is larger than the size of a sequence in the database, the candidate

sequence cannot be found in the sequence. Hence, the scanning of the sequence

in the database is skipped. The pattern existence constraint stops scanning the


Algorithm 4.2: Generating Constrained Rare Sequential Patterns andtheir Equivalence Classes.Input: a sequential database SDB, a set of rare sequential generators

RSGOutput: Generating all rare patterns and their equivalence classes

1 NEP ← {} // Here NEP holds all non-existent patterns in SDB2 GRSP ← {} // set of equivalence classes that have the same support

and they occur in the same sequences in SDB3 s ← 14 RSPs ← {g|g ∈ RSG, |g| = 1} //size-1 rare generators5 ms ← maxS∈SDB{|S|}6 while s < ms and RSPs 6= empty do7 CRSPs+1 ← {}// candidate rare sequence patterns of size-(s+1)8 for each S in RSPs do9 for each e in Events(SDB) do10 C ← all sequences generated by adding e into S at di�erent

positions11 CRSPs+1 ← CRSPs+1 ∪ C

12 RSPs+1 ← {}13 for each S in CRSPs+1 do14 if there is n in NEP such that n is a subsequence of S then15 continue16 else17 RSPs+1 ← RSPs+1 ∪ {S}

18 RSPs+1 ← RSPs+1 ∪ RSGs+1

19 for each S in RSPs+1 do20 S.supp ← 0, S.sid ← {}21 for a ∈ SDB and Csize(S, a) is true do22 if S v a and Cpe(S) is true then23 S.supp ← S.supp+124 S.sid ← S.sid ∪ {a.sid} //a.sid is the id of sequence a25 else26 NEP ← NEP ∪ {S}

27 sp ← S.supp, sid ← S.sid28 if GRSPsp,sid is in GRSP then29 GRSPsp,sid ← GRSPsp,sid ∪ {S}30 else31 GRSPsp,sid ← {S}32 GRSP ← GRSP ∪ {GRSPsp,sid}

33 s ← s+134 RSPs ← RSPs ∪ RSGs

35 return RSP = RSP1 ∪RSP2... ∪RSPs−136 return GRSP


database once the support of a candidate rare sequential pattern CRSPs+1 equals

to the maximum support threshold value maxsup as shown at step 22 of the Al-

gorithm 4.2. When the constrained rare sequential patterns RSPs+1 are found,

these patterns are separated into di�erent equivalence classes. Each equivalence

class holds the rare patterns that have the same support value and they occur in

the same sequences in database, which is shown at steps 28-32 of Algorithm 4.2.

4.5 Experimental Evaluation

In this section, we present the experimental methodology to evaluate our pro-

posed constraint-based rare sequential pattern mining algorithm. Firstly, we

describe the dataset used in the experiment to evaluate the proposed method.

Secondly, we explain the data pre-processing steps that prepares the dataset for

experimenting with the proposed algorithm. Finally, we describe the method

which is used in the experiment.

4.5.1 Dataset

In this experiment of this research we have also used o�-line SCADA control

logs. Three control systems were used as the source of logs. These logs con-

tain data about the process activities of the control systems. The features

that are used to record the activities in the logs hold the binary and the in-

teger values. For example, the feature Conv_Read_Conv_Color_PE as shown

in Table 4.2 that indicates the detection of color of the object puck running

on the conveyor belt. Depending on the color of the object puck, the feature

Conv_Read_Conv_Color_PE value changes from 0 to −1 or vice versa. Hence,

the diverting paddle directs the puck either on the left or on the right direction

on the conveyor belt. If the direction is on the left then the value of the feature

Conv_Read_Solenoid_Left_Direction changes from 0 to −1. However, if the di-

rection is on the right then the value of the feature Conv_Read_Solenoid_Right_

Direction changes from −1 to 0.

All of the features' values of the conveyor belt control logs are of binary

nature, meaning values are either 0 or −1. On the other hand, the feature which

is not of binary nature hold integer and �oating point values. For example, the

pressure values of the pressure control system changes from low to high when the

pressure increases in the pressure control system's pipe and from high to low when

4.5. Experimental Evaluation 105

Table 4.2: A partial view of a conveyor belt control logs.

VarName TimeString VarValueConv_Read_Solenoid_Left_Direction 16/06/2017 5:41:08 PM -1Conv_Read_Solenoid_Right_Direction 16/06/2017 5:41:08 PM 0Conv_Run_Status 16/06/2017 5:41:08 PM 0Conv_Read_Conv_Color_PE 16/06/2017 5:41:08 PM 0Conv_Read_Conv_HMI_Direction 16/06/2017 5:41:08 PM 0Conv_Read_Conv_Present_PE 16/06/2017 5:41:08 PM 0HMI_Conv_Master_Mode 16/06/2017 5:41:08 PM -1HMI_Conv_Reset 16/06/2017 5:41:08 PM 0HMI_Conv_Direction 16/06/2017 5:41:08 PM 0

the pressure releases from the pressure control system's pipe. The pressure values

are stored as �oating point values since the pressure status cannot indicated as

binary values. The �oating point pressure values are high variance in nature

which contribute to increase the number of unique events in the database.

The pressure control system feature Pipe_Read_Pipeline_Pressure as shown

in Table 4.3 holds the �oating point value 17.63478 that indicates the current

status of pressure in the pressure control system pipe. The pressure values grad-

ually changes from low to high and high to low when the pressure increases and

decreases on the pipe respectively. The nature of the SCADA control process is

that the changes of the values from the prede�ned set values is considered an

attack on the control system which can cause disruption to the process. This

means that changing the event's value or threshold value can alter the process

outcome and make it a rare event. For example, the change of the upper thresh-

old value of the pressure control event Solenoid_On_SP_Int from 40 PSI to 50

PSI make it a rare anomalous event that can hamper the outcome of the process.

This change of prede�ned set pressure value is an attack on the pressure control

system as this is the violation of the normal process activities. All of the pressure

values are recorded in �oat values under the variable. Further, in the water tank

control system, the feature Tank_Read_Tank_Level as shown in Table 4.4 holds

the �oating point value 61.77073 that indicates the current status of the water

level on the tank.

The water level values are also a high variance, which are stored in �oating

point. When the pump �lls the upper primary tank, the water level on the

primary tank increases from a low level to a high level. However, when the water

is drained from the upper primary tank, the water level on the primary tank


Table 4.3: A partial view of a pressure control logs.

VarName TimeString VarValueHMI_Pipe_Pump_O�_SP 16/06/2017 6:55:08 PM 40HMI_Pipe_Pump_On_SP 16/06/2017 6:55:08 PM 5HMI_Pipe_Solenoid_O�_SP 16/06/2017 6:55:08 PM 30HMI_Pipe_Solenoid_On_SP 16/06/2017 6:55:08 PM 40HMI_Pipe_Master_Mode 16/06/2017 6:55:08 PM -1Pipe_Pump_Run_Status 16/06/2017 6:55:08 PM -1Pipe_Read_Pipeline_Pressure 16/06/2017 6:55:08 PM 17.63478Pipe_Read_Pump_Mode 16/06/2017 6:55:08 PM 0Pipe_Read_Pump_Run_Cmd 16/06/2017 6:55:08 PM -1Pipe_Read_Solenoid_Mode 16/06/2017 6:55:08 PM -1

decreases from a high level to a low level. If the water tank high level threshold

value is changed to a very high value which is above the capacity of the upper

primary tank, the high threshold value will never be met. As a result, the pump

will never stop and hence the upper primary tank over�ows and �oods the control

system. Likewise, if the low level threshold value is changed to a very low value,

the low threshold value will not be reached. As a result, the pump will not start

to �ll the upper primary tank, although the water level has touched the low level

capacity of the upper primary tank. Like the pressure control system, some of

the features such as Tank_Stopped as shown in Table 4.4 holds binary value,

that is, 0 or −1. If Tank_Stopped holds the value 0, it indicates that the water

tank has been stopped. However, if the Tank_Stopped holds the value −1, it

indicates that the water tank is in running state.

Table 4.4: A partial view of a water tank control logs.

VarName TimeString VarValueTank_Level 16/06/2017 6:55:08 PM 61.77073Tank_O�_SP_Int 16/06/2017 6:55:08 PM 80Tank_On_SP_Int 16/06/2017 6:55:08 PM 50Tank_Read_Pump_In_Auto 16/06/2017 6:55:08 PM 0Tank_Read_Pump_In_Manual 16/06/2017 6:55:08 PM 0Tank_Read_Pump_Running 16/06/2017 6:55:08 PM 0Tank_Read_Tank_Level 16/06/2017 6:55:08 PM 61.77073Tank_Stopped 16/06/2017 6:55:08 PM -1Tank_Usage_Level 16/06/2017 6:55:08 PM 38.22927HMI_Tank_Master_Mode 16/06/2017 6:55:08 PM -1


To generate the control system logs di�erent methods can be used according to

the guidelines provided by the vendor of the SCADA control system devices.

In the �rst phase of data generation, recording and storing, di�erent methods

circular, segmented circular, display system event at, and trigger event method

can be used. In circular logging method, event activities are recorded in a �xed

sized log �le. Once the log �le size becomes full, the logging starts to overwrite

the existing log �le. The segmented circular logging method is a slight variation

of the circular log. Instead of overwriting the existing log �le after reaching the

size limit of the log �le, it creates a new log �le. If the newly created log �le gets

full, it starts writing again on to the existing previous log �le. For the experiment,

we have used circular logging method to generate the SCADA control logs.

In the second phase, during the data collection, three methods Cyclic, On-

change, and On-demand methods can be used to collect the process activity

values that are stored in the PLC devices on SCADA systems. In cyclic method,

the values stored in the PLC memories are polled simultaneously in a �xed time

interval. For example, in every 1 second time, values stored in di�erent variables

are gathered together. In the on-change method, only the values are polled which

have been changed during the polling time. The �nal method of data acquisition

is the on demand, where data are only polled when there is a demand or request

for the logs. This can be done in a non-periodic manner by running a script

in reply to a data request. However, the previous two methods are periodic,

meaning every time interval data are polled. Note that, due to high variance at

some values, such as pressure value changes and water level changes, on-change

method does not record every values that changes, rather in some interval time

values are recorded. The reason for choosing on-change logs acquisition mode is

that the method records the episodic events together in the logs, which allows

to segment the log sequences by a time-span gap.

The logs for the experiment of this chapter were generated and collected

for 8 hours of operation of the SCADA control systems. In this time period,

some attacks were conducted to disrupt the process activities of the control

systems. The conveyor belt dataset, the pressure system dataset and the water

tank datasets are comprised of 1 929, 47 586 and 18 679 lines of logs. In all of the

three control system datasets, di�erent types of attacks were conducted during

the data generation operation on the SCADA control systems. In the pressure

control system, attack was conducted by changing the pressure threshold values.


For example, the lower pressure threshold value was changed from the set value

20 PSI (Pound per Inch) to 25 PSI and the upper threshold value was changed

from set value 40 PSI to 45 PSI. In the conveyor belt control system, attack was

conducted by changing the direction of the diverter paddle, which results in an

unexpected change to the direction of the diverter paddle. In addition, on few

occasions the conveyor belt control system was unexpectedly stopped and then

started. Finally, in the water tank control system, some attacks were conducted

by unexpected changes to the mode of operation from automatic to manual and

manual to automatic mode of the water tank control system. Besides, on few

occasions, the water tank was stopped and started unexpectedly. Furthermore,

some �ooding attacks were conducted by changing the values of some of the

events in multiple times.

4.5.2 Pre-processing

The raw data that are collected from the data source cannot always be readily

available to use for experiment. The data needs pre-processing for preparing

suitable dataset for the data mining algorithm. Pyle D. [128] presents the basic

requirements for data pre-processing. These includes data cleaning, normaliza-

tion, transformation, feature extraction and feature selection. After completion

of the di�erent steps in pre-processing, a set of data is prepared for the actual

data mining process. The control logs collected for the experiment is in comma

separated value format (CSV), which are not readily suitable for our proposed

constraint-based rare sequential pattern mining algorithm.

To transform the CSV formatted SCADA control system logs to a sequential

database SDB, we �rst select the required features, then merging the features

with its corresponding values. The features merged with the values represent

an individual event. We formulated the feature selection rules to select the

important features and to drop the insigni�cant features while identifying the

anomalies. The features which carry the signi�cant information needed to iden-

tify the anomalous pattern are selected for the experiment. It means that the

features' values which can be changed to conduct attacks on the control system

are selected the proposed experiment. For example, an attack can be conducted

by changing the value of the feature Tank_Read_Pump_In_Manual to operate

the water tank control system in manual mode, rater than in automatic mode.

In addition, we do not select the features that hold the duplicate value of other


features. For example, the feature Pump_O�_SP_Int and HMI_Pipe_Pump_

O�_SP of the pressure control system hold the same value. Therefore, in-

stead of keeping the both features, we select one feature for the experiment.

We also dropped the features which hold the meta data, which means the fea-

ture's value that hold the information of other features. For example, the variable

Time_ms value of a feature hold the value in milliseonds format while the vari-

able TimeString hold the same value in date and time format. The bene�t of

selecting the signi�cant features and dropping the unimportant features is to

keep the size of the sequences in the database small. The reduced number of

features help to generate less number of candidate sequences. The reduced can-

didate sequences require less number of comparison to �nd the rare patterns

which consumes less computational time. We pre-processed the control logs (the

conveyor belt, the pressure control and the water tank control system) into se-

quence database respectively.

(i) Conveyor belt control logs: In the conveyor belt control system, the process

activities were recorded under 9 di�erent features. Each feature holds a bi-

nary value 0 or −1. Every individual feature is merged with its correspond-

ing values that represents an individual event. For example, the feature

Conv_Run_Status contains either 0 or −1 based on the current status of

the conveyor belt control system. If the conveyor belt is in running state,

the feature holds the value −1, otherwise the feature holds the value 0.

Therefore, this feature generates two events {Conv_Run_Status_0} and

{Conv_Run_Status_-1}. The log events are then segmented into di�erent

sequences which comprise the sequential database SDB. The sequences are

segmented based on the average time-span gap between two consecutive

episodic events in the control logs, which ensures that a pattern can be

generated from a time-span constrained episodic sequence. Otherwise, a

pattern can be fragmented into consecutive sequences if the sequence is not

segmented based on the time-span gap among the episode of events. There

are 171 sequences generated, which makes the conveyor belt database by

applying the time-span constraint.

Among the sequences in the database, the longest size of the sequence

is 12. This means that the longest sized sequence comprises 12 events.

There are 38 unique events found in the database. Among these events,

17 events are generated from 9 di�erent features. Since one of the features


emergency_stop holds only a single value instead of binary value, the fea-

ture generates a single event rather than two events. Hence, instead of 18

events the conveyor belt features generate 17 events from the 9 features.

The remaining 11 events out of 38 unique events are generated by com-

bining the unique events that occur simultaneously in the database. For

example, {Conv_Run_Status_0, Conv_Read_Conv_Color_PE_0} is an

unique event that comprises of two events {Conv_Run_Status_0} and

{Conv_Read_Conv_Color_PE_0} because these two events occurred si-

multaneously on the control system.

(ii) Pressure system control logs: In the pressure control system, the events

were recorded under 17 features unlike conveyor belt control system, where

events were recorded under 9 features. Among these 17 features, some of

the features hold binary number values, some other features hold integer

number values, and the rest of the features hold �oating point values. For

example, the feature Pipe_Solenoid_Open_Status holds the binary values

either 0 or −1 based on the current status of the solenoid of the pressure

control system. Since each feature is merged with its corresponding val-

ues to generate an event, the feature Pipe_Solenoid_Open_Status gener-

ates two individual events, such as {Pipe_Solenoid_Open_Status_0} and

{Pipe_Solenoid_Open_Status_-1}. Another feature Pipe_Read_Pipeline

_Pressure holds the �oating point values, such as 20.18696 which indicates

the current status of the pressure on the pipeline of the pressure control

system. The pipeline pressure changes from lower threshold value to the

upper threshold value and vice-versa. Since the pressure values are held

in �oating point, the feature Pipe_Read_Pipeline_Pressure is a high vari-

ance feature, which contributes to increase the number of unique events in

the pressure control database.

To reduce to the unique events in the database, we converted the high

variance feature values to the ceiling values. For example, the feature

Pipe_Read_Pipeline_Pressure holds the current pressure value. As the

pressure increases on the pressure control system, this feature keeps record-

ing the values in �oating point. The values are rounded up so that the num-

ber of unique events can be reduced. Since the features are merged with

its corresponding values, the less number of feature values will produce less

number of unique events. Therefore, the pressure value 20.18696 is rounded


up as 21, which is merged with the feature Pipe_Read_Pipeline_Pressure

to create an individual event Pipe_Read_Pipeline_Pressure_21. There-

fore, instead of producing many unique events, we reduce the number of

unique events by rounding up values.

After generating the individual events like the conveyor belt control database,

we create sequences from the pressure control logs by segmenting them with

an average time-span gap between two consecutive episodic events. Using

the average time-span gap among the episodic events, there are 232 num-

ber of sequences generated from the pressure control logs, which makes the

pressure control database. Among these sequences, the longest sequence

comprises of 17 events. In addition, a total of 72 unique events generated in

the pressure control database. Among the 72 unique events, 52 events are

generated by merging 16 features with their corresponding values. Since

the pressure control system features not only hold the binary values, but

also use integer and �oating point values, the number of unique events are

large unlike the conveyor belt control system. The remaining 20 events are

formed in combination of 52 unique events that occur simultaneously on

the pressure control database.

(iii) Water tank control logs: In the water tank control system, the process ac-

tivities were recorded under 10 di�erent features. Among these features,

some of the features hold binary values, and some other features hold inte-

ger and �oating point values like the pressure control system. The �oating

pint values of the high variance features are converted to ceiling values like

the pressure control system features. As a result, the number of unique

events are reduced. Like the previous two control systems, the water tank

database is created once the individual log events are generated. There are

323 sequences generated from the water tank control logs. The database

92 unique events. The longest sequence is composed of 22 events.

4.5.3 Experimental methodology

The datasets which are prepared during the pre-processing phase are used with

the proposed constraint-based rare sequential pattern mining algorithm as de-

scribe in Section 4.4. The algorithm comprises of two phases. In the �rst phase,

the Algorithm 4.1 generates constrained rare sequential generators. In the second


phase, the Algorithm 4.2 generates constrained rare sequential patterns which

are extended from the generators. To �nd the impact of using constraints with

the rare sequential pattern mining algorithm, which is discussed in Chapter 3,

we conducted experiments by applying with and without constraints on the same

datasets. Finally, we evaluated the performance of these experiments by using

the precision and the recall of the confusion matrix as shown in Table 4.5.

Precision or Detection Rate(DR): It de�nes the ratio between the number

of rightly detected malicious events or attacks and the total number of predicted

attacks or malicious events. The precision is de�ned as follows:

Precision =True Positive

True Positive+ False Positive% (4.1)

Here, True Positive indicates the correct identi�cation of an intrusive case by the

algorithm. On the other hand, False Negative represents the incorrect identi�ca-

tion of an intrusive case as a benign case. For example, assume that an anomaly

detection algorithm has raised 148 alarms, although only 20 alarms are found as

anomalous or attacks on the system as shown in Table 4.5. It is shown that the

anomaly detection algorithm has identi�ed 40 predictions as intrusive, but only

12 predictions are found correct, which is True Positive. On the other hand, 28

predictions are found incorrect, which is False Positive. In addition, it is also

shown that the algorithm identi�ed 108 predictions are benign or not intrusive.

Among these 8 predictions are found incorrect, which is False Negative. On the

other hand, 92 predictions are found, which is True Negative.

Table 4.5: Confusion matrix.

True ConditionConditionPositive

ConditionNegative

PredictiveCondition Condition

PositiveTP:12

(True Positive)

FP:28(False Positive)Type I error

ConditionNegative

FN:8(False Negative)Type II error

TN:100(True Negative)


Recall or True Positive Rate(TPR): It de�nes the ratio between the numbers

of rightly detected malicious events or attacks and the total number of actual

malicious events or attacks [129]. The recall is de�ned as follows:

Recall =True Positive

True Positive+ False Negative% (4.2)

Therefore, the precision and the recall of the anomaly detection algorithm is 30%

and 60% by using the confusion matrix.

We designed four experiments (Experiment-1 to Experiment-4 ) to measure

the e�ectiveness and e�ciency of our proposed constraint-based rare sequential

pattern mining algorithm to detect anomalies in the SCADA control system logs.

We conducted these four experiments individually on the three control system

datasets.

(i) First Experiment : The �rst experiment (Experiment-1 ) is designed to im-

plement our proposed constraint-based rare sequential pattern mining al-

gorithm without implementing any additional constraints apart from the

time-span constrained database. The aim of the �rst experiment is to

record the computational time of the rare sequential pattern mining al-

gorithm and the total number of rare sequential patterns the algorithm

generates. The computational time and the number of rare sequential

patterns are then compared with the computational time and the num-

ber of rare patterns generated by the other three experiments where the

constraints are implemented. In the experiment, the maximum support

threshold value was set to 2 because it was assumed that anomalies occur

rarely in a system. The low threshold value ensures that we can �lter out

the frequently occurred sequences from the control system databases.

(ii) Second Experiment : In the second experiment (Experiment-2 ), we added

the feature reduction constraint along with the time-span constraint to re-

duce the number of unique events on the database. The goal of the second

experiment is to �nd the computational time and the number of rare se-

quential patterns when the respective constraints are used separately. To

�nd anomalies in an e�ective and e�cient manner, we added feature reduc-

tion constraints in addition to the time-span constraint with the three con-

trol system databases. As a result, anomalies are detected from a reduced

number of rare sequential patterns which consumes less computational time


compared to the Experiment-1.

To achieve less number of rare patterns and less computational time, we

reduced the number of features from the database by removing less sig-

ni�cant features that do not contribute in �nding anomalies. This means

that the features, which are less likely to make any change to the process

outcome, are not selected in the rare sequential pattern mining process.

Also, those features that values are unlikely to be altered while conducting

attacks on the system are not selected in the mining process. So, we se-

lected only those features whose values can be changed to conduct attacks

on the control system. For example, changing in the value of the conveyor

belt feature HMI_Conv_Direction from 0 to −1 can alter the direction of

the sorted objects on the conveyor belt.

Firstly, for the conveyor belt control system experiment, we selected 3

features out of 9 features from the control logs. These three features with

their corresponding values resulted in 6 individual events. As a result,

the number of unique events were reduced from 38 unique events to 8

unique events in the conveyor belt database. With the reduced number

of unique events, there are 72 sequences generated from the conveyor belt

database instead of 171 sequences used in the Experiment-1, where no

feature reduction constraint was applied. In addition, the longest size of the

sequence is reduced to 3 events compared to 12 events in the Experiment-1.

Secondly, for the pressure control system experiment, we selected 3 features

out of 16 features. These 3 features were selected because the values of

these features can be altered to change the process outcome of the pressure

control system. As a result, the number of unique events were reduced

to 7 events from 72 events, which were used in the Experiment-1. Also,

the number of sequences were reduced to 48 sequences from 232 sequences

which were used in the Experiment-1. Further, the longest sequence size is

also reduced to 3 events from 17 events.

Finally, for the water tank control system experiment, we selected 4 features

from 10 features. As a result, the number of unique events were reduced

to 8 events from 92 events which were used in the Experiment-1. Also,

the number of sequences were reduced to 81 sequences from 323 sequences.

Further, the size of the longest sequence was reduced to 7 events from

22 events, which were used in the Experiment-1. After implementing the

4.6. Results and Analysis 115

feature reduction constraints in the three control system databases, we ran

our proposed rare sequential pattern mining algorithm with a maximum

support threshold value 2 like the previous Experiment-1.

(iii) Third Experiment : In the third experiment (Experiment-3 ), we applied

two algorithmic constraints along with the time-span constrained database

to avoid unwanted scanning of the database. We used the algorithmic

constraints without implementing the feature reduction constraints on the

database. The goal of the third experiment is to �nd the computational

time and the number of rare sequential patterns when the respective con-

straints are used independently. In other words, whether the algorithm

can �nd the same number of anomalous patterns which were found by

the Experiment-1 and the Experiment-2. Also, to compare the computa-

tional time with the previous two experiments, the Experiment-1 and the

Experiment-2.

(iv) Fourth Experiment : Finally, in the fourth experiment (Experiment-4 ), we

combined all the constraints, the feature reduction constraints and the al-

gorithmic constraints along with the time-span constrained database. The

goal of the fourth experiment is to evaluate the performance of the algo-

rithm when the constraints are used together rather than they are used

independently.

4.6 Results and Analysis

This section presents the experimental results and analysis. We conducted

the constraint-based rare sequential pattern mining on three control system

databases. At �rst, Section 4.6.1 presents the results obtained from the conveyor

belt database. Secondly, Section 4.6.2 presents the pressure control database. Fi-

nally, Section 4.6.3 presents the results found from the water tank control system

database.

4.6.1 Conveyor-belt Control System

The rare sequential pattern mining algorithm in the �rst experiment (Experiment-

1), which does not include any additional constraints other than the time-span


constrained database, generated 906 925 rare sequential patterns. The compu-

tational time the algorithm takes to generate these rare sequential patterns is

4 days, 7 hours 26 minutes. A partial view of the rare sequential patterns is

shown in Table 4.6. Among these rare sequential patterns, 4 rare sequential pat-

terns have been detected as anomalies as well as attack patterns. The remaining

rare sequential patterns are mere suspicious patterns which could not detect any

other anomalies. Hence, these suspicious patterns are less important for detecting

anomalies. The algorithm required a large computational time to generate these

rare sequential patterns, although most of the rare patterns do not contribute to

detect anomalies. Moreover, it is di�cult to detect anomalous patterns from the

large number of rare sequential patterns. This is because the security operators

need to check these rare patterns to manually identify the anomalies. Therefore,

Experiment-1 shows that the rare sequential pattern mining algorithm without

any additional constraints other than the time-span constrained database is less

e�ective and e�cient in detecting anomalies.

Table 4.6: A partial view of the conveyor-belt result from the Experiment-1.

SID1 〈{Conv_Read_Conv_HMI_Direction_0}}〉SID2 〈{HMI_Conv_Direction_− 1}〉SID3 〈{Conv_Read_Conv_HMI_Direction_− 1}〉SID4 〈{Conv_Read_Conv_HMI_Direction_− 1}, {HMI_Conv_

Direction_− 1}〉SID5 〈{Conv_Read_Solenoid_Left_Direction_− 1}, {Conv_Read_

Solenoid_Right_Direction_0}, {Conv_Read_Conv_Color_PE_0}, {Conv_Read_Conv_Present_PE_0}, {Conv_Run_Status_0}, {Conv_Run_Status_− 1}, {Conv_Run_Status_0}, {Conv_Run_Status_− 1}, {Conv_Read_Solenoid_Left_Direction_0}{Conv_Read_Solenoid_Right_Direction_− 1}, {Conv_Read_Conv_Color_PE_− 1}, {Conv_Read_Conv_HMI_Direction}〉

In the Experiment-2, where we added the feature reduction constraints in

addition to the time-span constrained database, the rare sequential pattern

mining algorithm generated 16 rare sequential patterns. In comparison to the

Experiment-1, which generated 906 925 rare sequential patterns, the Experiment-

2 generated only 16 rare sequential patterns, which were signi�cantly reduced.

Moreover, the computational time for the Experiment-2 is less than a minute

compared to the computational time 4 days, 7 hours 26 minutes of the Experiment-

1. Although the Experiment-2 generated only 16 rare sequential patterns, the


Experiment-2 detects the same number of anomalous patterns 4, which was also

detected by the Experiment-1. It means that the Experiment-2 did not miss any

anomalies that were detected by the Experiment-1. However, the Experiment-1

detected the anomalous patterns from the large number of rare sequential pat-

terns, which took large computational time compared to computational time of

the Experiment-2. Since �nding anomalies from less number of rare sequential

patterns do not require extensive work for the security operators, the feature

reduction constrained Experiment-2 is more e�ective in �nding anomalies than

the Experiment-1. In addition, as the Experiment-2 takes less computational

time compared to the computational time by the Experiment-1, the feature re-

duction constrained Experiment-2 is more e�cient in detecting anomalies than

the Experiment-1. A partial view of the results from the Experiment-2 is shown

in Table 4.7.

Table 4.7: A partial view of the conveyor-belt result from Experiment-2.

SID1 〈{Conv_Read_Conv_HMI_Direction_0}}〉SID2 〈{HMI_Conv_Direction_− 1}〉SID3 〈{HMI_Conv_Direction_0}〉SID4 〈{Conv_Read_Conv_HMI_Direction_− 1}〉SID5 〈{Conv_Read_Conv_HMI_Direction_− 1}, {HMI_Conv_

Direction_− 1}〉SID6 〈{Conv_Read_Conv_HMI_Direction_0}, {HMI_Conv_

Direction_0}〉SID7 〈{Conv_Run_Status_− 1}, {Conv_Run_Status_− 1}〉SID8 〈{Conv_Run_Status_− 1}, {Conv_Run_Status_0}〉SID9 〈{Conv_Run_Status_0}, {Conv_Run_Status_− 1}〉SID10 〈{Conv_Run_Status_− 1}, {HMI_Conv_Direction_− 1}〉SID11 〈{Conv_Run_Status_− 1}, {Conv_Read_Conv_HMI_

Direction_− 1}〉SID12 〈{Conv_Run_Status_− 1}, {Conv_Read_Conv_HMI_

Direction_− 1}, {HMI_Conv_Direction_− 1}〉

In the third experiment (Experiment-3 ), which used the algorithmic constraints

without applying the feature reduction constraints on the database, the al-

gorithmic constraint-based rare sequential pattern mining algorithm generated

906 925 rare sequential patterns, which are the same numbers as generated by

the Experiment-1. This is because both of the experiments, the Experiment-1

and the Experiment-3, did not implement the feature reduction constraint. As

a result, the number of unique events, the size of sequences, and the size of


the database remain unchanged. However, the computational time taken by the

Experiment-3 is 3 days, 2 hours and 54 minutes, which is less than the compu-

tational time taken by the Experiment-1, which is 4 days, 7 hours 26 minutes as

shown in Table 4.8. This is due to the algorithmic constraint which contributed

the Experiment-2 to reduce the computational time.

The fourth experiment (Experiment-4 ), which have used the feature reduc-

tion constraint and the algorithmic constraint together along with the time-span

constrained database, has generated 16 rare sequential patterns. Among these 16

rare sequential patterns, 4 patterns were detected as anomalous patterns when

compared to the labelled attack dataset. The generated rare sequential patterns

by the Experiment-4 are signi�cantly less than the number of rare sequential

patterns generated by the Experiment-1 and the Experiment-3. The reason is

the Experiment-4 used the feature reduction constraint and the algorithmic con-

straint, which the Experiment-1 and the Experiment-3 did not use. The feature

reduction constraints in Experiment-4 contributed to reduce the unique events,

the size of sequences, and and the size of the database, which contributed to

reduce the generation of rare sequential patterns compared to the Experiment-1

and Experiment-3. The reduced rare sequential patterns required less computa-

tional time. In addition, the algorithmic constraints also contributed to reduce

the computational time while generating the rare sequential patterns as shown

in Table 4.8.

Table 4.8: A comparison table showing the number of rare sequential patternsand the computational time taken by the four experiments on the conveyor-beltdatabase.

Experiment# # Rare patterns Execution timeExperiment-1No ConstraintConveyor belt SDB

906 9254 days7 hours26 minutes

Experiment-2Feature ConstraintConveyor belt SDB

16 < 1 minute

Experiment-3Algorithmic ConstraintConveyor belt SDB

906 9253 days2 hours54 minutes

Experiment-4Combined ConstraintConveyor belt SDB

16 < 1 minute


The Experiment-2 and the Experiment-4 generated the same 16 rare sequential

patterns because these two experiments applied the feature reduction constraints.

In comparison to the Experiment-2, the Experiment-4 also uses the algorithmic

constraints, which reduced the computational time in seconds. Since the feature

reduction constraint reduced the unique events, the size of sequences and the

database, the computational time reduction between the Experiment-2 and the

Experiment-4 is minimum.

It has been found from the four experiments that the implementation of con-

straints improved the e�ectiveness and e�ciency of the proposed rare sequential

pattern mining algorithm. It means the constraint-based rare sequential pattern

mining is e�ective because the constrained algorithm generated less rare sequen-

tial patterns. It is convenient to detect the anomalous patterns from the less

rare sequential patterns compared to the large rare sequential pattern. In ad-

dition, the constraint-based rare sequential pattern mining algorithm is e�cient

since the generation of rare sequential patterns consumes less computational time

compared to the large computational time by the large computational time by

the proposed rare sequential pattern mining algorithm.

4.6.1.1 Performance Evaluation

We also evaluated the performance of our proposed constraint-based rare se-

quential pattern mining algorithm by using the precision and the recall of the

confusion matrix as shown in Table 4.5. There were 5 attacks conducted on the

conveyor belt control system during the logs generation phase. These attacks

were of 3 types:

(a) Unscheduled stoppage of the conveyor belt.

(b) Unscheduled start of the conveyor belt.

(c) Unwanted changes to the direction of the diverter gate of the conveyor belt.

In the logs, the attacked events were identi�ed and labelled so that the anoma-

lies can be detected by verifying the rare sequential patterns with the labelled

dataset. The proposed constraint-based rare sequential pattern mining algorithm

successfully detected 4 anomalous patterns by verifying the rare sequential pat-

terns with the attacked dataset, which contained 5 actual attacks on the conveyor

belt control system. For example, a rare sequential pattern 〈{Conv_Run_Status_


-1},{Conv_Run_Status_0}〉 shown in the row SID8 in Table 4.7 was detected

as an attack pattern. This is because during the attacking procedure the conveyor

belt control system was deliberately stopped while it was in running state. The

�rst event {Conv_Run_Status_-1} of this attack pattern indicates that the con-

veyor belt was in running state, while the second event {Conv_Run_Status_0}

indicates that the conveyor belt was stopped. Another example of an anomalous

and attack pattern detected in the conveyor belt database is 〈{Conv_Run_Status_-1}, {Conv_Read_Conv_HMI_Direction_-1}, {HMI_Conv_Direction_-1}〉shown in the row SID12 in Table 4.7 which indicates that the direction of the

conveyor belt's diverter gate was changed while the attack was conducted on

the control system. It means that the values of the second and third events

{Conv_Read_Conv_HMI_Direction_-1} and {HMI_Conv_Direction_-1} re-

spectively of this attack pattern were changed from 0 to −1.

It is found from the Experiment-2 and the Experiment-4 that the constraint-

based rare sequential pattern mining algorithm has correctly detected 4 attacks

from the 16 generated rare sequential patterns. The remaining 12 rare sequen-

tial patterns were not identi�ed as anomalous pattern. The True Positive and

the False Positive of the constraint-based rare sequential pattern mining algo-

rithm is 4 and 12 respectively. Therefore, the precision of the algorithm for the

Experiment-2 and the Experiment-4 can be calculated by using the equation

(4.1) is 25%. It means that the anomaly detection rate of the algorithm is 25%.

Since the algorithm detected 4 attack patterns out of 5 actual attacks, the True

Positive is 4 and the False Negative is 1. Therefore, the recall of the constrained

rare sequential pattern mining algorithm can be calculated by using the equation

(4.2) is 80%, which means the sensitivity or true positive rate of the algorithm

is 80%.

4.6.2 Pressure Control System

The rare sequential pattern mining algorithm in the Experiment-1, which did

not include any additional constraints, generated 1 107 540 rare sequential pat-

terns. Among these rare sequential patterns, 5 rare sequential patterns were

detected as anomalous patterns by verifying with the labelled attack dataset. A

partial view of the rare patterns are shown in Table 4.9. The remaining rare

sequential patterns are suspicious patterns, which are less important to detect

anomalies. The computational time to generate these rare sequential patterns


by the Experiment-1 is 5 days, 11 hours and 40 minutes.

Table 4.9: A partial view of the pressure control result from Experiment-1.

SID1 〈{Solenoid_On_SP_Int_40}〉SID2 〈{Solenoid_O�_SP_Int_25}〉SID3 〈{Solenoid_On_SP_Int_45}〉SID4 〈{Solenoid_On_SP_Int_45}, {Solenoid_On_SP_Int_40}〉SID5 〈{Pipe_Read_Solenoid_Mode_− 1}, {Solenoid_On_SP_

Int_40}, {Pressure_Int_45}, {Pipe_Read_Solenoid_Open_Cmd_− 1}, {Pipe_Read_Pipeline_Pressure_44}, {Pressure_Int_43}, {Pipe_Read_Solenoid_Open_Cmd_0}, {Pipe_Read_Pipeline_Pressure_41}, {Pipe_Read_Solenoid_Open_Cmd_− 1}〉

The Experiment-1 detected 5 anomalous patterns although the experiment gen-

erated a large number of rare sequential patterns, which required large compu-

tational time. It is di�cult and time time consuming to identify the anomalous

patterns from the large number of rare sequential patterns. Therefore, to detect

anomalies in an e�ective and e�cient manner, meaning to detect anomalies from

a small number of rare sequential patterns with a less computational time, we

added the feature reduction constraints in the second experiment (Experiment-

2 ). The Experiment-2 generated only 9 rare sequential patterns, which were

signi�cantly reduced compared to 1 107 540 rare sequential patterns generated

by the Experiment-1. Even though the Experiment-2 has generated only 9 rare

sequential patterns, the number of detected anomalous patterns, which is 5,

are the same as detected by the Experiment-1. It means that the Experiment-

2 did not miss to detect any anomalous patterns which were detected by the

Experiment-1.

The feature reduction constrained Experiment-2 is more e�ective than the

without feature reduction constrained Experiment-1 because the Experiment-2

generated less number of rare sequential patterns. It is easier to �nd anomalies

from less number of rare sequential patterns compared to �nd anomalies from

than the large number of rare sequential pattern. In addition, the computational

time taken by the Experiment-2 is less than a minute, which was reduced sig-

ni�cantly, compared to the computational time 5 days, 11 hours and 40 minutes

taken by the Experiment-1. Hence, Experiment-2 is more e�cient for anomaly

detection than the Experiment-1. A partial view of the Experiment-2 generated

rare sequential patterns are shown in Table 4.10.


Table 4.10: A partial view of the pressure control result from Experiment-2.

SID1 〈{Solenoid_On_SP_Int_40}〉SID2 〈{Solenoid_O�_SP_Int_25}〉SID3 〈{Solenoid_On_SP_Int_45}〉SID4 〈{Solenoid_Off_SP_Int_30}〉SID5 〈{Solenoid_Off_SP_Int_30}, {Solenoid_On_SP_Int_40}〉SID6 〈{Solenoid_On_SP_Int_45}, {Solenoid_On_SP_Int_40}〉SID7 〈{Pipe_Read_Solenoid_Mode_− 1}, {Solenoid_On_SP_

Int_40}〉SID8 〈{Pipe_Read_Solenoid_Mode_− 1}, {Solenoid_On_SP_

Int_45}〉SID9 〈{Pipe_Read_Solenoid_Mode_− 1}, {Solenoid_On_SP_

Int_45}, {Solenoid_On_SP_Int_40}〉

In the third experiment (Experiment-3 ), the algorithmic constrained rare sequen-

tial pattern mining algorithm generated 1 107 540 rare sequential patterns, which

are the same number of rare sequential patterns as generated by the Experiment-

1. Although the Experiment-1 and the Experiment-3 generated exactly the same

number of rare sequential patterns, the Experiment-3 took less computational

time 4 days, 2 hours and 14 minutes compared to the computational time 5 days,

11 hours and 40 by the Experiment-1. The algorithmic constraints contributed

to reduced the computational time for the Experiment-3. The Experiment-3

also detected exactly the same number anomalous patterns as detected by the

Experiment-1 and Experiment-2, which is 5. This means that the Experiment-3

did not miss any attack patterns which were detected by the previous two exper-

iments the Experiment-1 and the Experiment-2. The computational time for the

Experiment-2 which is less than a minute which is signi�cantly lower than the

Experiment-1 and the Experiment-3. This is because the Experiment-2 applied

feature reduction constraint that contributed to reduce the unique events, the

size of sequences in the database and the size of the database. These three factors

not only contributed to increase the number of rare sequential patterns but also

contributed to increase the computational time as shown in Table 4.11. Hence,

the Experiment-3 is more e�cient than the Experiment-1, but less e�cient than

the Experiment-2.

Finally, the fourth experiment (Experiment-4 ) has produced 9 rare sequential

patterns, which are equal to the number of rare sequential patterns generated by

the Experiment-2. This is because both of the Experiment-2 and the Experiment-


Table 4.11: A comparison table showing number of rare patterns and time takenby all 4 experiments on pressure control SDB.

Experiment# # Rare patterns Execution timeExperiment-1No ConstraintPressure control SDB

1 107 5405 days11 hours40 minutes

Experiment-2Feature ConstraintPressure control SDB

9 < 1 minute

Experiment-3Algorithmic ConstraintPressure control SDB


Experiment-4Combined ConstraintPressure control SDB

9 < 1 minute

4 used the feature reduction constraint, which contributed to reduce the number

of rare sequential pattern. Although the Experiment-4 used additional algorith-

mic constraint, it did not contribute to further reduce the rare sequential pattern.

This is because the algorithmic constraint does not reduce the rare sequential

pattern, rather the algorithmic constraint reduces the computational time. The

computational time of the two experiments (Experiment-2 and Experiment-4 )

is less than a minute, although the Experiment-4 took less computational time,

which is in seconds, than the Experiment-2. The computational time reduction

between the Experiment-2 and Experiment-4 by the algorithmic constraint is

minimum due to the small size of the database caused by the feature reduction

constraint. If the database size was not reduced, the algorithmic constraint could

reduce the computational time signi�cantly which is shown in Table 4.11 between

the Experiment-1 and Experiment-4.

The Experiment-4 also detected 5 anomalous patterns which is equal to

the number anomalous patterns detected by the previous three experiments

Experiment-1, Experiment-2 and Experiment-3. It has been found that imple-

mentation of constraints have improved the e�ectiveness and e�ciency of the

proposed rare sequential pattern mining algorithm. A comparison table regard-

ing the number of rare sequential patterns generated and the execution time

taken by the four experiments are given in Table 4.11.


4.6.2.1 Performance Evaluation

We also evaluated the performance of the proposed constraint-based rare sequen-

tial pattern mining algorithm. The performance is measured with the precision

and the recall using the confusion matrix shown in Table 4.5. In the pressure con-

trol system, there were 6 attacks conducted during the generation of the control

logs. These attacks were of 4 types:

(a) Unexpected changes to the upper threshold value of the pressure control

system.

(b) Unexpected changes to the lower threshold value of the pressure control

system.

(c) Unscheduled stopping the pressure control system.

(d) Unscheduled starting the pressure control system.

The constrained rare sequential pattern mining algorithm successfully detected 5

anomalous patterns. For example, the pattern 〈{Solenoid_O�_SP_Int_25}〉,shown in the row SID2 in Table 4.10, indicates that the current status of the pres-

sure control system's lower threshold value 25 PSI of the pressure control system.

As a precondition for the pressure control system experiment, the pressure control

system's lower threshold value was set to 20 PSI. Since the pressure control sys-

tem's lower threshold value was deliberately changed from the prede�ned value

20 PSI to 25 PSI, the pattern 〈{Solenoid_O�_SP_Int_25}〉 indicates an attackpattern. The pressure control system's lower threshold value was intentionally

changed during the attack procedure conducted on the pressure control system.

Another example of a rare sequential pattern that was detected as an attack pat-

tern is 〈{Pipe_Read_Solenoid_Mode_-1}, {Solenoid_On_SP_Int_45}, {Solen

oid_On_SP_Int_40}〉, which is shown in the row SID9 in Table 4.10. This is

because the pressure control system's upper threshold value was changed from

the prede�ned set value 40 PSI to 45 PSI. This unexpected change was done

during the attacking procedure conducted on the pressure control system.

Since constraint-based rare sequential pattern mining algorithm in Experiment-

2 and in Experiment-4 has detected 5 anomalous patterns from the 9 generated

rare sequential patterns, the True Positive and False Positive of of the algorithm

is 5 and 4 respectively. The precision of the constraint-based rare sequential


pattern mining algorithm is 55% by using the equation in (4.1). On the other

hand, since the constraint-based rare sequential pattern mining algorithm has

detected 5 attack patterns out of 6 actual attacks, the True Positive is 5 and

False Negatives is 1. So, the recall the algorithm is 83% by using the equation

(4.2)

4.6.3 Water Tank Control System

The �rst experiment (Experiment-1 ) has generated 1 204 514 rare sequential

patterns from the water tank control system. To generate these rare sequential

patterns, the rare sequential pattern mining algorithm has taken 6 days, 2 hours

and 11 minutes. A partial view of the generated rare sequential patterns by the

Experiment-1 is given in Table 4.12. Although there were some attacks conducted

on the water tank control system, no anomalous patterns were detected unlike

the conveyor belt and the pressure control system experiments. The attacks were

of two types:

(a) Unexpected change of the water tank mode of operation from the manual

mode to the automatic mode.

(b) Unexpected change of the water tank mode of operation from the automatic

mode to the manual mode.

Table 4.12: A partial view of the water tank result from Experiment-1.

SID1 〈{Tank_Off_SP_Int_80}〉SID2 〈{Tank_On_SP_Int_50}〉SID3 〈{Tank_Off_SP_Int_80}, {Tank_On_SP_Int_50}〉SID4 〈{Solenoid_Off_SP_Int_30}〉SID5 〈{Tank_Read_Pump_In_Auto_− 1}, {Tank_Read_Pump_

Running_− 1}, {Tank_Level_45}, {Tank_Read_Tank_Level_45}, {Tank_Usage_Level_55}, {Tank_Level_43}, {Tank_Read_Tank_Level_43}, {Tank_Usage_Level_57}〉

SID6 〈{Tank_Read_Pump_In_Auto_− 1}, {Tank_Read_Pump_In_Auto_− 1}, {Tank_Read_Pump_Running_0}, {Tank_Usage_Level_76}, {Tank_Level_25}, {Tank_Read_Tank_Level_25)},{Tank_Usage_Level_72}, {Tank_Level_29}〉

These two types of attacks were done as �ooding attacks which means the changes

to the mode of operation was conducted in multiple times on the water tank


control system. Due to the �ooding attacks, the changes to the mode of operation

made the events as frequent events. Since the proposed rare sequential pattern

mining algorithm generates the rare sequential patterns comprising rare events,

the �ooding attacks could not be detected on the water tank control system.

In the second experiment (Experiment-2 ), the feature reduction constrained

rare sequential pattern mining algorithm generated 65 rare sequential patterns,

which is signi�cantly reduced compared to the rare sequential patterns generated

by the Experiment-1. A partial view of the results are shown in Table 4.13.

The feature reduction in the Experiment-2 also reduced the computational time,

Table 4.13: A partial view of the water tank result from Experiment-2.

SID1 〈{Tank_Off_SP_Int_80}〉SID2 〈{Tank_On_SP_Int_50}〉SID3 〈{Tank_Off_SP_Int_80}, {Tank_On_SP_Int_50}〉SID4 〈{Solenoid_Off_SP_Int_30}〉SID5 〈{Tank_Read_Pump_In_Auto_− 1}, {Tank_Read_Pump_

Running_− 1}〉SID6 〈{Tank_Read_Pump_In_Auto_− 1}, {Tank_Read_Pump_

In_Auto_− 1}〉SID7 〈{Tank_Read_Pump_Running_− 1}, {Tank_Read_Pump_

In_Auto_− 1}〉SID8 〈{Tank_Read_Pump_Running_0}, {Tank_Read_Pump_

In_Auto_− 1}〉

which is less than a minute, compared to the computational time 6 days, 2

hours and 11 minutes taken by the Experiment-1. Although the number of rare

sequential patterns and the computational time were reduced by the Experiment-

2, yet no anomalous patterns were detected. This is because the �ooding attacks

made the events on the water tank control system to become frequent. Therefore,

the proposed constrained rare sequential pattern mining algorithm could not

detect the �ooding attacks on the control system.

In the third experiment (Experiment-3 ), the algorithmic constrained rare se-

quential pattern mining algorithm generated 1 204 514 rare sequential patterns,

which are equal to the numbers of rare sequential patterns generated by the

Experiment-1 as shown in Table 4.14. This is because both of the Experiment-3

and Experiment-1 did not apply the feature reduction constraints. Although the

Experiment-3 and the Experiment-1 generated equal number of rare sequential

patterns, the computational time di�ers between these two experiments. The


computational time taken by the Experiment-3 is 4 days, 17 hours and 40 min-

utes, which is less than the computational time 5 days, 11 hours and 40 minutes

taken by the Experiment-1. Like the previous two experiments, the Experiment-1

Table 4.14: A comparison table showing number of rare patterns and computa-tional time taken by the four experiments on water tank control system database.

Experiment# # Rare patterns Execution timeExperiment-1No ConstraintWater tank SDB


Experiment-2Feature ConstraintWater tank SDB

65 < 1 minute

Experiment-3Algorithmic Constraintwater tank SDB


Experiment-4Combined ConstraintWater tank SDB

65 < 1 minute

and Experiment-2, the rare sequential patterns generated by the Experiment-3

could not detect any anomalous patterns because of the �ooding attacks on the

control system.

Finally, the fourth experiment (Experiment-4 ) of the water tank control sys-

tem, where we implemented all of the constraints together, generated the equal

number of rare sequential patterns as generated by the Experiment-2 as shown

in Table 4.14. The reason for generating equal number of rare sequential pat-

terns is that both of the experiments, Experiment-4 and Experiment-2, used

the feature reduction constraints, which contributed to reduce the number of

rare sequential patterns. The computational time for the Experiment-4 is less

than the computational time of the Experiment-3 and Experiment-1. Although

the computational time for the Experiment-4 and the Experiment-2 are less

than a minute, Experiment-4 took less time, which is few seconds, compared

to the Experiment-2 because of the usages of algorithmic constraint. The com-

putational time di�erence between the Experiment-4 and the Experiment-2 is

minimum due to the reduced database, which is caused by the feature reduc-

tion constraint. If the database size was not reduced, the algorithmic constraint

could reduce the computational time signi�cantly which is evident between the

Experiment-1 and the Experiment-3 as shown in Table 4.14.


Like all of the previous three experiments, the Experiment-4 could not detect

any anomalous patterns because of the �ooding attacks on the water tank con-

trol system. Since the four experiments conducted on the water tank database

could not detect any anomalous patterns, we could not calculate the perfor-

mance regarding the precision and the recall of the constrained rare sequential

pattern mining algorithm. A comparison table regarding the number of rare se-

quential patterns generated by these four experiments and their computational

time to generate the rare sequential patterns from the water tank control system

database is shown in Table 4.14.

4.7 Discussion

This section gives the analysis and discussion of the methods and results of the

experiments. In the experiment section we have shown that di�erent constraints

needed to use with rare sequential pattern mining algorithm to �nd anomalies in

an e�ective and e�cient manner. The proposed constraint-based rare sequential

pattern mining algorithm (Algorithm 4.1 and Algorithm 4.2) works in two phases.

In the �rst phase, the Algorithm 4.1 generated constrained rare sequential gen-

erator patterns where it was not possible to implement the pattern existence

constrain, which is one of the two algorithmic constraints. The reason for this,

in the �rst phase of the algorithm, it was needed to check if a candidate sequence

pattern is either a rare sequential pattern or a frequent sequential pattern.

On the other hand, if we implemented the pattern existence constraint, we

would not be able to �nd the frequent sequential patterns. The reason is once the

candidate sequence pattern support value meets the maximum support threshold

value, the pattern existence constraint stops further scanning the database. As a

result, we can only �nd the rare sequential pattern. But, the candidate sequence

pattern could also become frequent sequential pattern which were ensured by not

including the pattern existence constraint in the Algorithm 4.1. In addition, if the

pattern existence constraint were used, the generation of rare sequential generator

patterns would have been limited. This is because the rare sequential generator

patterns are generated from the frequent sequential patterns. If the frequent

patterns are not generated due to the implementation of the pattern existence

constraint, the generation of rare sequential patterns are stopped. Therefore, it

was needed to check if a candidate sequence pattern is a frequent pattern by

4.7. Discussion 129

further scanning the database even if the support value of a candidate sequence

satis�es the maximum support threshold value.

Although the pattern existence constraint could not be used with the Al-

gorithm 4.1, the other algorithmic constraint, the pattern size constraint, was

used with the Algorithm 4.1. The pattern size constraint contributed to reduce

the computational time while generating the rare sequential generator patterns

by avoiding unwanted database scanning. This pattern size constraint reduces

computational time by not scanning the database sequences which size is smaller

than the size of the candidate sequence. The computational time reduction

depends on the number of events in the candidate sequence and the database

sequence which is skipped during the database scanning. Also, the total number

of database sequences which are scanned to look for the candidate sequences

contributes to reduce the computational time. This means scanning the larger

number of database scanning reduces the minimum computational time, while

the smaller number of database scanning reduces the maximum computational

time.

In the second phase, the Algorithm 4.2 implemented both of the algorithmic

constraints, the pattern existence and the pattern size, which contributed to

reduce the computational time while generating rare sequential patterns. The

pattern existence constraint enforced the unwanted scanning of the database

once the support value of a candidate sequence reached the maximum support

threshold value. In the second phase, the implementation of the pattern existence

constraint was possible because a candidate sequence becomes a rare sequential

pattern if the candidate sequence is found in the database. Unlike the �rst

phase, in the second phase it is not possible to �nd frequent sequential pattern.

The reason is the candidate sequence pattern is generated by extending the rare

sequential pattern.

Since any pattern that is extended from a rare sequential pattern becomes

a rare sequential pattern. Therefore, the support of a candidate sequence pat-

tern, which are extended from rare sequential pattern, satis�es the maximum

support threshold value, scanning the rest of the sequences in the database costs

the computational time. So, the pattern existence constrained rare sequential

pattern mining algorithm reduces computational time. The second algorithmic

constraint, the pattern size constraint, also contributes to reduce the computa-

tional time in addition to the pattern existence constraint, which is explained


previously in the �rst phase of the algorithm (Algorithm 4.1). Since the Algo-

rithm 4.2 uses two algorithmic constraint compared to one algorithmic constraint

by the Algorithm 4.1, the Algorithm 4.2 reduces more computational time than

the Algorithm 4.1.

The lesson learnt is that three factors; the number of unique events in a

database, the size of sequences in a database and the size of a database, con-

tributes to the performance of the proposed constraint-based rare sequential

pattern mining algorithm. If unique events in a database increases, the com-

putational time for generating rare sequential patterns also increases. This is

because when a candidate sequence pattern is extended from a rare sequential

pattern, each unique event is added in di�erent position of the rare sequential

pattern. The large number of unique events creates large candidate sequential

patterns, which cost large computational time of the algorithm.

Moreover, the large size of sequences in the database contributes to increase

the large number of candidate sequential pattern. This is because the candidate

sequence pattern extends equal to the size of the longest sequence in a database.

The candidate sequence extends from the size-1 sequence to the maximum size of

the sequences in the database. In other words, the candidate sequence extends

from size-1 sequence to size-2 sequence, which stops in the size-n sequence,

where the size-n is the maximum size of the sequences in the database. In

the process of extending the candidate sequence, the unique events increase the

candidate sequential patterns. Since candidate sequential patterns are scanned

in the database to �nd rare sequential patterns, the large candidate sequential

pattern cost large computational time. Finally, the large size of database also

contributes to cost the large computational time. This is because to �nd rare

sequential pattern, the entire database is scanned for each candidate sequential

pattern. Therefore, the large sized database cost large computational time. It

has also been learnt that the reason for the large number of unique events depends

on the range of values that are held by the control system features. If the control

system features hold the values in �oating point, the number of unique events

increases signi�cantly. This is because each control system feature is merged

with its corresponding values. So, if a feature holds a range of �oating point

values, it creates a large number of unique events.

In the three control system experiments, it has been found that implemen-

tation of the constraints with the rare sequential pattern mining algorithm im-

4.7. Discussion 131

proved the e�ectiveness and e�ciency for detecting anomalies. The constrained

rare sequential pattern mining algorithm is e�ective because the algorithm gen-

erates small number of rare sequential patterns, which requires less e�ort to �nd

the anomalous patterns compared to �nding the anomalous patterns from the

large number of rare sequential patterns. The constrained rare sequential pat-

tern mining algorithm is e�cient since generating rare sequential patterns takes

less computational time compared to without constrained rare sequential pattern

mining algorithm. Although the constrained rare sequential pattern mining algo-

rithm reduced the number of rare sequential patterns, the algorithm did not fail

to detect any anomalous pattern which were detected by the without constrained

algorithm.

The proposed constraint-based rare sequential pattern mining algorithm suc-

cessfully detected anomalies on the control system. The constraint-based rare se-

quential pattern mining is an improvement on the rare sequential pattern mining

algorithm discussed in Chapter 3. The constraint-based rare sequential pattern

mining algorithm has successfully detected the same anomalous patterns which

were detected by the rare sequential patterns, which means that the constrained

rare sequential pattern mining algorithm did not compromise in detecting anoma-

lies. But the constraint-based rare sequential pattern mining algorithm has de-

tected the anomalous patterns in an e�ective and e�cient manner. The detection

of anomalies is e�ective because anomalies are identi�ed from small number of

rare sequential patterns compared to the large rare sequential patterns generated

by the rare sequential patterns. Therefore, with the constraint-based rare sequen-

tial pattern mining algorithm, it is easy for the security operators to identify the

anomalous patterns. In addition, the constrained rare sequential pattern mining

algorithm is e�cient because this method detects anomalies in quick time, which

is less than a minute compared to several days taken by the without constrained

rare sequential pattern mining algorithm. Therefore, security operators do not

need to wait long time, rather they can detect anomalies in less than a minute.

Although the constraint-based rare sequential pattern mining algorithm has

detected anomalies e�ectively and e�ciently, the constrained algorithm could

not detect the �ooding attacks on the SCADA control system. This is because

of the characteristics of the proposed algorithm, which �nds the rare sequential

patterns. Since �ooding attacks are conducted by applying the same events by

repeating in multiple times, the attack changes the events frequency from rare


to frequent. Therefore, the �ooding attacks cannot be detected by our proposed

rare sequential pattern mining method.

4.8 Conclusion

The anomaly detection using constraint-based rare sequential pattern mining al-

gorithm successfully detected attack patterns from the SCADA control system

logs. Compared to without constraint-based rare sequential pattern mining al-

gorithm, the constraint-based rare sequential pattern mining algorithm has been

found to be e�ective and e�cient in detecting anomalies on SCADA control sys-

tem. It is e�ective discarding the less important rare sequential patterns which

helps to reduce the number of rare sequential patterns. As a result, the detection

of anomalies from less number of rare sequential patterns were found e�ective

because it requires less e�ort to �nd anomalies from the less number of rare se-

quential pattern. The constraint-based rare sequential pattern mining algorithm

is found e�cient because the algorithm takes less computational time to identify

the anomalies. Therefore, the proposed constraint-based rare sequential pattern

mining algorithm is promising since in some cases it takes months and sometimes

even a year to detect cyber incident in a control system after the incident occurs

[130].

We have analysed and demonstrated that the implementation of the in rare

sequential pattern mining can be e�ective and e�cient by only focusing on the

interested patterns, and reducing the computational time for detecting anoma-

lies. We validated our constraint-based rare sequential pattern mining results

with the SCADA labelled attack dataset. In this experiment we have shown

that our proposed constraint-based rare sequential pattern mining algorithm can

successfully be used to detect anomalies e�ectively and e�ciently on a SCADA

control system.

The anomaly detection which we have done in this chapter, Chapter 4, and in

the previous chapter, Chapter 3, are based on experimenting the o�-line SCADA

control system logs. The o�-line logs mean that SCADA process control events

are stored in log �les. These logs are then collected to pre-process for preparing

the sequential databases. Our proposed rare sequential pattern mining algorithm

generates rare sequential patterns from where anomalous patterns are detected.

The proposed method can only detect anomalies that have already occurred

4.8. Conclusion 133

on the SCADA control system. However, this method cannot predict anomalies

before they occur on the SCADA control system. In the next chapter, Chapter 5,

we show how possible anomalies on the SCADA control system can be predicted

by using the on-line or streaming SCADA control system logs.

Chapter 5

A Rare Sequential Association

Rules Mining of SCADA Streaming

Logs for Anomaly Prediction

5.1 Introduction

The sequential association rules mining refer to discovering rules in a sequen-

tial database. Every individual rule consists of several sequential events. These

events are divided into two parts, the antecedent and the consequent. In many

application domains sequential association rule mining has been applied to anal-

yse data and predict future events. For example, in stock market analysis, e-

learning, and drought management [131] sequential rules have a high prediction

accuracy compared to sequential patterns. This is because sequential rules pre-

dict possible occurring of future events based on existing current events [132]. To

�nd sequential rules, several algorithms have been developed in the literature.

Mannila et al. [48] �nds sequential rules by analyzing alarm �ow in telecommuni-

cation networks logs. If two sets of events (also called episodes) occur frequently

in a sequence, a rule can be generated such as X ⇒ Y where X and Y represent

two sets of events in a sequence.

The rule indicates that if an event or set of events X occur, it is likely that

another event or set of events Y will also occur after sometime. The probability of

their occurrence can be indicated with a con�dence value that can be computed

134


as support(X∪Y )support(X)

. The rule is only valid if the con�dence value of the rule satis�es

the user provided minimum con�dence minconf value. Harms et al. [133] �nds

rules if the frequent antecedent X is followed by frequent consequent Y from

several sequences in a sequential database. Lo et al. [134] �nds rules that are

common to several sequences. These rules strictly maintain the order of events

inside the antecedent X and consequent Y and also the order between the X and

Y. However, Fournier-Viger el al. [135] proposed sequential rules wherein the

order of the events inside the antecedent X and consequent Y is not considered.

But, the order is only maintained between the antecedent X and the consequent

Y. In other words, the antecedent part will be followed by the consequent part.

The existing algorithms for sequential association rules mining in the liter-

ature discover rules using frequent patterns. However, no works generate rules

from rare sequential patterns although these patterns can be used for detecting

anomalies and attacks, which this research has shown in Chapter 3 and Chapter

4 of this thesis. Note that, since sequential association rules can be generated

from rare sequential patterns, the antecedent X and the consequent Y do not

always need to be rare. Rather these two parts, that is, the X and the Y of a

rule can be frequent, or in combination of rare and frequent patterns unlike the

frequent sequential rules where both X and Y of a rule must be frequent. In

sequential association rule mining, the antecedent could be considered as a pre-

cursor to the likelihood of an incoming or ongoing anomalous and attack pattern

in a streaming log. The precursor can be considered the unusual initiating events

connected to the consequent of the association rules.

The goal of the sequential association rule mining on this research is to dis-

cover rules from rare sequential patterns so that these rules can be used to predict

and detect anomalies in the incoming streaming logs. For example, if the rare

sequential pattern 〈{a}, {b}, {c}, {d}〉 is detected or identi�ed as an anomalous

and attack pattern, it could be possible that from this rare sequential pattern the

following sequential association rules can be generated: 〈{a}〉 ⇒ 〈{b}, {c}, {d}〉;〈{a}{b}〉 ⇒ 〈{c}, {d}〉; 〈{a}{b}{c}〉 ⇒ 〈{d}〉. Note that, every rare sequential

pattern comprising N number of events (N ≥ 2) would generate a maximum of

(N -1) rules. Not all these rules can be used to predict the anomalies in the incom-

ing streaming logs. Only the valid rules that have a con�dence value higher than

the user provided minimum con�dence value minconf can be used for prediction.

136Chapter 5. A Rare Sequential Association Rules Mining of SCADA Streaming Logs

for Anomaly Prediction

5.1.1 Motivation

We have shown in Chapter 3 and Chapter 4 of this thesis that rare sequential

patterns can be used to detect anomalies and attacks in SCADA control system

logs. A sequential rule is important to give a better understanding of data.

It provides a good correlation between two segments of data [136]. There is an

increasing demand from industry not only to detect anomalies in the system, but

also to predict anomalies in a system. To predict a possible incoming anomalous

pattern, sequential association rules can be e�ective in predicting anomalies. If

the antecedent is contained in the incoming logs, the consequent is likely to occur

soon. Since rules are generated from rare sequential patterns, the probability of

anomaly prediction is high once the antecedent is found.

However, we also argue that it is not always possible to predict an incoming

anomalous pattern from incoming streaming logs. The reason is that, in some

application domains it is possible to conduct attacks on the system by changing

only a single event's set value. For example, during the operation of a pressure

control system, the prede�ned lower and upper pressure threshold value remain

unchanged for the entire experiment. But, it is possible to disrupt the expected

output of the pressure control system by changing the prede�ned threshold val-

ues. If the prede�ned upper threshold value 40 PSI of the pressure control system

is changed to 60 PSI, the object that undergoes the pressure control system could

produce faulty output. In this case, the pressure control feature that merges with

the changed threshold value makes a single event or size-1 pattern. The size-1

pattern can be found as a rare sequential pattern on the pressure control system.

As a result, sequential association rules cannot be generated from the single event

rare sequential pattern. When this single event rare sequential pattern occurs on

the system, the rare pattern indicates an anomaly. A pattern with a single event

can be used to indicate an anomaly, but cannot be used to predict anomaly. The

reason is that the single event or size-1 rare pattern is composed of one segment

unlike the sequential association rule that is composed of two segments the an-

tecedent and the consequent. Therefore, using the single event rare sequential

pattern it is not possible to predict possible incoming anomaly, rather this single

event rare sequential pattern is used to detect the anomalies that has occurred

on a system.

There are two contributions in this chapter. Firstly, this is the �rst work

that uses sequential association rules to predict anomalies in the SCADA control

5.2. Previous Work 137

system streamed logs. Secondly, even if in some cases the proposed method

cannot predict the incoming attack, it can detect the attack when it happens in

the system. Hence, this method can also ensure there are no false negatives in

the system. This is because the discovered rare sequential patterns can detect

all of the possible rare anomalies in a system. To validate our proposed method,

we used SCADA control system logs for possible anomaly predictions.

5.2 Previous Work

Sequential association rules can be discovered from a single sequence or multiple

sequences [137] [67]. Mannila et al. [48] �nds sequential rules by analysing alarm

�ow in telecommunication networks. They discover rules from a single sequence.

If two sets of events (or episodes) occur frequently in a sequence, an association

rule can be generated as X ⇒ Y where X and Y represent two sets of events in a

sequence. The rule indicates that if an event or set of events X occur, it is likely

that another event or set of events Y may occur within a de�ned sliding window.

The probability of their occurrence can be indicated with a con�dence value. It

means that the higher the con�dence value of an association rule, the more likely

that the consequent of the association rule will occur after the antecedent has

occurred. Harms et al. [133] �nds rules if a frequent antecedent X is followed by

a frequent consequent Y from several sequences in a sequential database within a

user de�ned time window. Lo et al. [134] �nds rules that are common to several

sequences. In all these works, the order of the events in both the antecedent

were preserved. However, Fournier-Viger et al. [135] generates partially ordered

sequential association rules from multiple sequences. In this method, the order

of the events inside the antecedent and consequent is not considered. The order

is only maintained between the antecedent and consequent of a rule.

Sequential association rules can be used in di�erent applications, such as

weather forecasting, disease identi�cation, �nancial investement risk. In the

medical sector, they can be used to predict a disease from a sequence of symp-

toms. In the capital market, based on a sequence of stock market events, the

probable investment risk can be predicted using sequential association rules [121]

[138]. The existing algorithms in the literature discover sequential association

rules from frequent sequential patterns. In these rules both antecedent and con-

sequent segments are frequent. However, there exist no algorithm that discovers



association rules from rare sequential patterns, although rare sequential patterns

can be used to detect anomalies in a system. The anomaly detection procedures

for rare sequential patterns have been discussed in the Chapter 3 and the Chap-

ter 4 of this thesis. In these two chapters it is shown that the rare sequential

pattern mining algorithm can generate rare sequential patterns from SCADA

static or stored control logs. These rare patterns can e�ectively detect anomalies

that have occurred on the SCADA system. However, these rare patterns cannot

be used to predict a possible incoming anomalies on the SCADA live system.

This is because there exist not correlation among the events in the rare sequen-

tial pattern so that anomaly prediction can be done on the basis of a correlated

sequence of events.

To address this limitation of the rare sequential pattern mining algorithm, in

this Chapter 5, we propose a two-phase anomaly prediction algorithm. The goal

of this algorithm is to discover sequential rules that appear in a rare sequence.

In the �rst phase, Algorithm 5.1 generates sequential association rules from the

rare sequential patterns. The Algorithm 5.1 only generates the valid association

rules which satis�es the user provided minimum con�dence value. In the second

phase, Algorithm 5.2 uses the valid sequential association rules generated by the

Algorithm 5.1 to predict possible incoming anomalous patterns from SCADA

streaming control logs. A prediction of an anomalous pattern is raised by trig-

gering a rule once the antecedent of an association rule is found in the streaming

logs.

5.3 Preliminaries

This section presents the background knowledge of sequential association rules.

This research generates sequential association rules from rare patterns that cor-

relates among the events in a sequence. A sequential association rule X ⇒ Y

represents an association or correlation between the two sequences X and Y.

This rule is validated by the support and the con�dence parameters. The Sup-

port indicates the frequency of sequence 〈X, Y 〉 in a sequential database, while

the Con�dence de�nes the degree of the correlation between X and Y. In other

words, the possibility that Y will occur after X has occurred. The support and

con�dence is de�ned as:

� Support, S(X ⇒ Y ) = (σ(X∪Y ))N

where N is the total number of transactions


and σ is the number of transactions which contain both X and Y. The

support should be less than or equal to user provided maximum support

maxsup threshold value, that is, support ≤ maxsup. For example, the

sequence 〈{a}, {c}, {e}〉 in Table 5.1 appears twice in SID3 and SID5 in

the sequential database SDB. The total number of sequences in the SDB

is 6. Therefore, the relative support of this sequence is 33%. On the other

hand, in absolute count the support of the sequence is 2.

Table 5.1: A sequential database SDB.

Sequence ID SequencesSID1 〈{a}, {b, d}, {e}, {c}〉SID2 〈{a}, {c}, {b, c}, {a}〉SID3 〈{a, b}, {b, c}, {c, d}, {e}〉SID4 〈{b}, {c, e}, {e}〉SID5 〈{a}, {b, d}, {e}, {f}, {d}〉SID6 〈{g}〉

� Con�dence, C(X ⇒ Y ) = (S(X∪Y ))(S(X))

where S(X ∪ Y ) is the support of the

pattern X followed by consequent Y of a sequential association rule, and

S(X) is the support of the antecedent X. The con�dence should be greater

than or equal to minconf threshold value, that is, con�dence ≥ minconf.

For example, the sequential association rules R1 = 〈{a}〉 ⇒ 〈{c}, {e}〉and R2 = 〈{a}, {c}〉 ⇒ 〈{e}〉 generated from the rare sequential pattern

〈{a}, {c}, {e}〉. The con�dence of the rule R1 is 0.5 because the support of

rare pattern S(X ∪ Y ), that is, 〈{a}, {c}, {e}〉 is 2 and the support of the

antecedent S(X), that is, 〈{a}〉 is 4. On the other hand, the con�dence

of the rule R2 is 0.6 because the support of rare pattern S(X ∪ Y ), that

is, 〈{a}, {c}, {e}〉 is 2 and the support of the antecedent S(X), that is,

〈{a}, {c}〉 is 3. Note that, even though the support of the rare pattern is

same in the rules R1 and R2, their con�dence values are not the same.

Sequential association rules are generated from rare sequential patterns. The

association rules that are generated from the rare sequential patterns can be de-

�ned as follows:

R = {X ⇒ Y | X, Y are frequent or rare sequential patterns, X ∪ Y is a rare

sequential pattern, Sup(X ⇒ Y) ≤ maxsup, Conf(X ⇒ Y) ≥ minconf }



These association rules can be generated from rare sequential patterns that have

at least two events. However, if the rare sequential pattern is composed of single

event, no sequential association rule can be generated. In that case, the single

event rare sequential pattern can be used to detect anomalies. For example, the

sequence 〈{a}, {c}, {e}〉 in the example dataset in Table 5.1 is a rare sequential

pattern which is composed of three events. This rare sequential pattern can

produce at most two sequential association rules. The �rst sequential associ-

ation rule is 〈{a}〉 ⇒ 〈{c}, {e}〉 and the second sequential association rule is

〈{a}, {c}〉 ⇒ 〈{e}〉. Generally, in generating association rules, there could be

many possible combinations of the events both in the antecedent and the conse-

quent segments. The consequent 〈{c}, {e}〉 of the �rst sequential association rule〈{a}〉 ⇒ 〈{c}, {e}〉 could be arranged either in 〈{c}, {e}〉 or in 〈{e}, {c}〉. Also,the antecedent 〈{a}, {c}〉 of the second sequential associational rule 〈{a}, {c}〉⇒ 〈{e}〉 could also be arranged with the following two combinations 〈{a}, {c}〉and 〈{c}, {a}〉. Therefore, if we considered all of the possible combinations of

events in both the antecedent and the consequent of an association rule, then

the rare sequential pattern 〈{a}, {c}, {e}〉 could generate four sequential associ-

ation rules, such as 〈{a}〉 ⇒ 〈{c}, {e}〉; 〈{a}〉 ⇒ 〈{e}, {c}〉; 〈{a}, {c}〉 ⇒ 〈{e}〉;〈{c}, {a}〉 ⇒ 〈{e}〉.

However, as sequential association rules are generated from rare sequential

patterns, di�erent combination of events other than the regular combination in

the rare sequential patterns cannot be possible. This is because the events in the

rare sequential patterns occur sequentially. Therefore, the rare sequential pat-

tern 〈{a}, {c}, {e}〉 generates at most two sequential association rules instead of

four sequential association rules. On the other hand, the sequence 〈{g}〉 in SID6

of the example dataset in Table 5.1 is also a rare sequential pattern when the

maximum support threshold value is set to 2. Since this rare sequential pattern is

composed of a single event or it is a size-1 rare sequential pattern, sequential as-

sociation rule cannot be generated from this pattern. Instead, this pattern is con-

sidered as a size-1 rare sequential pattern 〈{g}〉, which is used to detect anomalyrather than predict anomaly. The sequential association rule can be used to pre-

dict and detect possible anomalous patterns from an incoming streaming logs.

To predict possible incoming anomalies, the streaming logs are segmented into

sequence of time windows or sessions denoted as 〈W1,W2, ...,Wn−1,Wn〉 shownin Figure 5.1. The session time for each of the window Wi is determined by the

5.4. A New Anomaly Prediction Method Using Sequential Association Rules 141

Figure 5.1: Anomaly prediction from streaming logs.

user. In other words, the streaming events are segmented into windows according

to the session time set by the user.

The �gure shows that the incoming streaming log data is segmented into

a sequence of time windows starting with the window W1 and continues until

the streaming logs are ended. Each window contains a sequence Si comprising

some number of ordered events e1, e2, ..., eni. In other words, a sequence Si =

〈e1, e2, ..., eni〉 is created from each window Wi. If any antecedent of a sequential

association rule is contained in the sequence Si, it can be predicted that the

consequent of the rule may occur after sometime.

5.4 A New Anomaly Prediction Method Using

Sequential Association Rules

Although rare sequential pattern mining method can detect anomalies, it can-

not predict incoming anomalies from streaming logs. If a prediction regarding

ongoing anomalies can be made ahead of its occurrence on a system, the system

operator can be alerted so that necessary precautions can be taken. Therefore,

there is a need to develop an anomaly prediction system. To achieve this we

propose a new method to predict potential anomalies which is composed of two

phases. In the �rst phase, our proposed algorithm generates sequential associa-

tion rules automatically from the rare sequential patterns. Then in the second

phase of the method, these sequential association rules are used to predict and

detect possible anomalies in the streaming logs.



In order to generate sequential association rules, we need rare sequential pat-

terns. The rare sequential patterns that have at least two or more events are

used to generate sequential association rules. In other words, the rare sequential

patterns that have size-2 or more are used to generate sequential association

rules. This is because the size-1 rare sequential pattern is composed of a single

event which can not be separated as antecedent and consequent of an association

rule. Moreover, while generating association rules we only generated the longest

antecedent association rules instead of all variable length association rules. Gen-

erally, n-1 association rules can be generated from a rare sequential pattern of

size-n. For example, the rare sequential pattern 〈{a}, {b}, {c}, {d}〉 of size-4 can

generate the following 3 sequential association rules:

(i) 〈{a}〉 ⇒ 〈{b}, {c}, {d}〉

(ii) 〈{a}, {b}〉 ⇒ 〈{c}, {d}〉

(iii) 〈{a}, {b}, {c}〉 ⇒ 〈{d}〉

Among these three association rules, the 3rd association rule 〈{a}, {b}, {c}〉 ⇒〈{d}〉 is the longest antecedent association rule. The antecedents of the 1st and

the 2nd association rules are subsequence patterns of the antecedent of the 3rd

association rule. These three rules are redundant rules as they contain the equal

number of sequential events. If these three rules are used to predict the possi-

ble anomalous pattern 〈{a}, {b}, {c}, {d}〉, three predictions can occur. In other

words, for one possible anomalous pattern, three rules are triggered. However,

if the longest antecedent rule is used to predict the possible anomalous pattern

〈{a}, {b}, {c}, {d}〉, a single prediction occurs instead of three predictions. There-fore, to reduce the number of anomaly prediction and not to trigger redundant

rules, in this experiment we only generated the longest antecedent rules rather

than variable length rules.

The rest of this section is given as follows: Section 5.4.1 describes generating

sequential association rules for the experiment, Section 5.4.2 presents the pre-

diction of anomalies using the sequential association rules, and �nally Section

5.4.2.1 describes the prediction method.

5.4.1 Generating Sequential Association Rules

To generate association rules, �rst we generate rare sequential patterns from

a sequential database SDB by applying our proposed rare sequential pattern


mining algorithms (Algorithm 3.1 and Algorithm 3.2), which are presented in

Chapter 3. The algorithm for generating sequential association rules, Algorithm

5.1, generates all possible rare sequential association rules from rare sequential

patterns. The inputs to this algorithm are a sequential database SDB, a set

of rare sequential patterns RSP and a user provided minimal con�dence value

minconf. The variable SAR in the step 1 of the Algorithm 5.1 holds a set of

sequential association rules, which is the output generated from a sequential

database SDB. In steps 2-13, the algorithm generates sequential association rules

until there exists no rare sequential patterns in RSP. For example, the sequence

〈{a}, {b, d}, {e}, {f}, {d}〉 is a rare sequential pattern P in SID5 of the example

dataset given in Table 5.1. To generate sequential association rules from the rare

sequential pattern P, the rare sequential pattern needs to be checked if it is a

single event or size-1 rare sequential pattern or not. This checking needs to be

done because if the rare sequential pattern is composed of a single event or it is a

size-1 rare sequential pattern, a sequential association rule cannot be generated.

This checking is done at step 3 of the Algorithm 5.1.

In the above example, the rare sequential pattern P is not a single event

pattern; rather the rare sequential pattern is composed of �ve events. Therefore,

the Algorithm 5.1 generates sequential association rules from this rare sequen-

tial pattern P. The step 4-12 of Algorithm 5.1 generates sequential association

rules from the rare sequential pattern 〈{a}, {b, d}, {e}, {f}, {d}〉. The largest

antecedent, which is a sub-pattern of size-(n-1) of the rare sequential pattern

P , is assigned to the variable ante, that is, ante = 〈{a}, {b, d}, {e}, {f}〉 shownat step 4 of the Algorithm 5.1. The consequent cons is formed by removing

the sub-pattern 〈{a}, {b, d}, {e}, {f}〉 from the rare sequential pattern P. So, the

consequent is 〈{d}〉 is assigned to the variable cons shown at step 5 of Algorithm5.1. These two variables ante and cons together form a sequential association

rule, ante ⇒ cons. It means an association rule 〈{a}, {b, d}, {e}, {f}〉 ⇒ 〈{d}〉which is the only generated rule from the rare sequential pattern P.



Algorithm 5.1: Generating Sequential Association Rules.Input: a sequential database SDB, a set of rare sequential patterns RSP

and their supports, and a minimum con�dence minconfOutput: a set of generated sequential association rules

1 SAR ← {} // set of sequential association rules from the SDB2 for P ∈ RSP do3 if |P | > 1 then4 ante ← sp(i, P ) //sp(i, P ) is a pattern containing the left-most i

//events, where i ← |P | − 15 cons = P\sp(i, P ) //remove sp(i, P ), cons is a consequent of a

//rule6 count the support of Sup(ante) by scanning SDB7 conf ← Sup(P) / Sup(ante) //support of P is divided by the

//support of its antecedent8 if conf ≥ minconf then9 SAR ← SAR ∪ {ante⇒ cons, conf}

10 else11 continue

12 else13 continue

14 return SAR

Once an association rule ante ⇒ cons is generated, the Algorithm 5.1 �nds the

support value of the antecedent ante of the association rule by scanning the

database, which is shown at step 6 of the Algorithm 5.1. After generating the

support value of the antecedent of an association rule, the con�dence of the as-

sociation rule is calculated. The con�dence is calculated by dividing the support

value of the rare sequential pattern P with the support value of antecedent of

the pattern P, which is done at step 7 of the Algorithm 5.1. The generated

association rules can be valid rules only when those association rules have a con-

�dence value higher than the user de�ned minimum con�dence value minconf

shown at step 8 of the Algorithm 5.1. A valid association rule signi�es that if

the antecedent of a rule occurs in the streaming logs then it is highly likely that

the consequent of the same rule will occur next in the streaming logs. The valid

association rules are assigned to the sequential association rules variable SAR

as shown at step 9 of the Algorithm 5.1. The Algorithm 5.1 stops generating

association rules when there exists no rare sequential patterns RSP. The valid

association rules SAR generated in the �rst phase by the Algorithm 5.1 are used


to predict and detect anomalies using the Algorithm 5.2 in the second phase

which is discussed next in Section 5.4.3.

5.4.2 Prediction of Anomalies using Sequential Associa-

tion Rules

In the second phase, our proposed method anomaly prediction and detection using

sequential association rules, in Algorithm 5.2 predicts and detects anomalies in

the streaming logs. In the prediction and detection process, the logs are checked

with the valid sequential association rules that are generated by the Algorithm

5.1. In the beginning, incoming streaming logs are segmented into a sequence

of �xed time window sessions 〈W1,W2, ...,Wn〉 as shown in Figure 5.1. Each of

the session window Wi is composed of a sequence Si containing some sequence of

events 〈e1, e2, ..., eni〉. The number of events can vary in each sequence of session

window Wi. In other words, the number of events in each sequence depends on

the variable number of events that occur during the �xed time session period.

For example, assume that a sequence S1 comprises of four events {a}, {b, c}, {d},{f} that occur in the session window W1 from the streaming log, while the next

sequence S2 comprises of three events {c}, {b, d}, {a} that occur in the window

W2 from the streaming log.

Even though the �xed time duration of the two windows W1 and W2 is the

same, the number of events in the two sequences S1 and S2 can be di�erent. In

the prediction and detection process, the Algorithm 5.2 takes a set of sequential

association rules rSAR as input. The sequential association rules are stored in

the variable TR as shown in step 1 of the Algorithm 5.2. The Algorithm 5.2 also

takes a sequence of windows 〈W1,W2, ...,Wn〉 as inputs that are segmented from

a streaming logs. The algorithm starts with the window Wi where the value of i

starts from 1 and continues as long as the streamed logs exists. Next, sequence

Si comprising a sequence of events from windowWi are generated in step 6 of the

Algorithm 5.2. Then the sequence Si is checked with each association rule rSAR in

TR. If r is a sequential association rule rSAR, then it is checked if the antecedent

X of the sequential association rule rSAR is found or contained in the sequence

Si. In other words, if antecedent X v Si, it is predicted that consequent Y of

the association rule X ⇒ Y will occur next in the streaming logs, which is shown

in steps 8-10 of the Algorithm 5.2. Therefore, the rule X ⇒ Y is triggered and

a prediction alert TRP is raised for a likely incoming event or sequence of events



Algorithm 5.2: Anomaly Prediction using Sequential AssociationRules.Input: a set of sequential association rules rSAR, a sequence of �xed

time session windows 〈W1,W2, ...,Wn〉 from a streaming logOutput: display attack patterns

1 TR ← rSAR // a set of sequential association rules rSAR2 TRP ← {} // a set of triggered rules of possible predicted anomalies3 TRD ← {} // a set of triggered rules of possible predicted anomalies

//followed by detection4 k ← 2 //k is a maximum number of window for consequent of a rule

//to occur5 for i ← 1, i = n do6 Si ← 〈e1, ..., eni

〉 is a sequence of ith window Wi

7 for r ∈ TR, r = X ⇒ Y do8 if antecedent X v Si then9 display consequent Y of r indicating Y may happen soon

10 TRP ← TRP ∪ {(Wi, (X ⇒ Y, conf))}11 for j ← 1, j = k do12 if Y v Si+j then13 TRD ← TRD ∪ {(Wi+j, (X ⇒ Y, conf))}14 else15 remove (Wi, (X ⇒ Y, conf) from TRP

16 else17 continue

18 return TRD

of the consequent Y of the rule. The consequent may occur in the same window

Wi where the antecedent of a rule has occurred, or the consequent may occur

in upcoming windows Wi+j. If the consequent Y occurs within a de�ned time

period, then the possible anomaly prediction alert is correct. So, it is reported

that the triggered rule TRD has been detected as true, which is shown in the

steps 11-13 of the Algorithm 5.2. However, if the consequent Y does not occur

after the antecedent X occurred during the de�ned time period, the possible

anomaly prediction alert TRP is removed from the system, which is shown in

step 15 of the Algorithm 5.2. Therefore, the possible anomaly prediction TRP

by the triggered rule has been found false.

For example, let the antecedent of an association rule is 〈{a}, {b, d}〉 ⇒ 〈{d}〉which is checked with a sequence 〈{a}, {c}, {b, d}, {b}〉 denoted as Si. The se-


quence is generated from the windowWi in a streaming log. Since the antecedent

〈{a}, {b, d}〉 of the rule is contained in the window Wi, it is predicted that the

consequent 〈{d}〉 of the rule will occur soon in the streaming log. The consequentmay happen in the same window, that is, Wi after the antecedent has occurred,

or the consequent may be in the sequence Si+1 of the window Wi+1 or in the

sequence Si+2 of the window Wi+2. These windows Wi+1 and Wi+2 are within

the de�ned time period for the rule's consequent to remain valid once the an-

tecedent of the rule has occurred. If the sequence Si+1 is 〈{e}, {a}, {f}, {c}〉 andsequence Si+2 is 〈{b}, {c}, {e}, {a}, {f}, {d}〉, then the consequent is found in thesequence Si+2 after the antecedent has occurred in the sequence Si, in window

Wi. So, the possible anomaly prediction alert is found correct. Therefore, the

triggered rule, its con�dence and the window number where the rule is found is

stored in the variable TRD as shown in step 13 of the Algorithm 5.2. However,

if the consequent of a rule is not found in Wi+1 or Wi+2 although its antecedent

have been found in Wi, the rule is not triggered. This is because the consequent

of the rule did not occur in the streaming logs within the de�ned time period.

5.4.2.1 Prediction Methods

The prediction of a consequent of a rule occurring after the antecedent is found in

the streaming log could be done using two di�erent methods. The �rst method,

which is called the variable length antecedent rule, is to raise an alert after �nding

any antecedent X from a set of sequential association rules that are contained in

the sequence Si of a window Wi from the streaming log. The second method,

which is called the longest antecedent rule, is to �nd the largest antecedent of the

association rules that is contained in the sequence Si of a window Wi and then

raise a possible anomaly alert. For example, assume an incoming log sequence

Si, that is, 〈{a}, {b, d}, {e}, {b}, {f}, {a}, {d}〉 in a window Wi. The association

rules shown in Table 5.2 are then checked with the sequence Si to predict a

possible incoming anomalous pattern.

5.4.2.2 Variable Length Antecedent Rules

In this method, the antecedent 〈{a}, {b, d}〉 of the rule R2 in Table 5.2 is con-

tained in the incoming log sequence Si. The algorithm raises a possible anomaly

alert on the system. The alert indicates that it is likely that the consequent

〈{e}, {f}, {d}〉 of the rule R2 may occur on the system. The consequent may



occur in the same sequence where the antecedent is found, or the consequent

may occur in the next windows that exist within de�ned time period. If the

consequent is found in the same sequence Si followed by the antecedent, the

prediction is immediately found true. It is considered that a possible anomaly

has happened on the system. If the consequent is not found in the same se-

quence Si, it is predicted that the consequent may occur in the next windows

that are within a de�ned time period. If the consequent is found in the next

windows, the possible anomaly prediction becomes true. It means that the pos-

sible anomaly prediction has been detected in the streaming logs. Otherwise, the

possible anomaly prediction alert is removed from the prediction system. For the

association rule R2, the prediction is found true in the same sequence Si where

the antecedent is found.

Similarly, when the antecedent 〈{a}, {b, d}, {e}〉 of the rule R3 is contained in

the sequence Si, it is predicted that the consequent sequence of events 〈{f}, {d}〉is likely to occur on the system after some time. Since the consequent is found

in the same sequence Si, the prediction becomes true. Finally, the antecedent

〈{a}, {b, d}, {e}, {f}〉 of the rule R4 is also found in the sequence Si. The pre-

diction of the consequent 〈{d}〉 is also found in the same sequence. Therefore,

the prediction is also found true. Note that, the antecedent and the consequent

of all these three rules R2, R3, and R4, have been found in the incoming log

sequence Si. Therefore, even the consequence did not happen, they also have

been used to do the prediction. These three rules are generated from a single

rare sequential pattern, and these rules contain equal number of events. Three

instances of raising prediction are done from these three rules, although they

contain the same number of events. This method is not e�ective as it triggers

many predictions where a single prediction is su�cient. As a result, this method

consumes a lot of resources, which can be improved.

Table 5.2: A view of possible rare sequential association rules.

R1 〈{a}〉 ⇒ 〈{b, d}, {e}, {f}, {d}〉R2 〈{a}, {b, d}〉 ⇒ 〈{e}, {f}, {d}〉R3 〈{a}, {b, d}, {e}〉 ⇒ 〈{f}, {d}〉R4 〈{a}, {b, d}, {e}, {f}〉 ⇒ 〈{d}〉


5.4.2.3 Longest Antecedent Rules

In this method, instead of raising three prediction alerts from three rules, the

algorithm raises a single alert from a single rule R4. The other rules are not

generated. The reason is the rule R4 has the largest antecedent among the three

rules R2, R3 and R4. It means that the antecedent of the rules R2 and R3 are

contained in the antecedent of the rule R4. Since the antecedent of the rule R2

and R3 is a subsequence of the antecedent of the rule R4, it is not logical to raise

alerts from the rules R2 and R3. For example, the antecedent of the sequential

association rule R2 is 〈{a}, {b, d}〉 and the antecedent of the sequential associa-

tion rule R3 is 〈{a}, {b, d}, {e}〉. These two antecedents are the subsequence of

the antecedent 〈{a}, {b, d}, {e}, {f}〉 of the sequential association rule R4. So,

the longest antecedent method raises a single prediction alert compared to the 3

prediction alerts that are raised by the variable length method.

The longest antecedent method is e�ective compared to the variable length

method to predict possible anomalies in the streaming logs. This is because

if anomaly prediction becomes false, the longest antecedent method generates

less false positives since the longest antecedent method generates less prediction

alerts. However, as the variable length antecedent method generates more pre-

diction alerts compared to the longest antecedent method, the variable length

antecedent method produces more false positives than the longest antecedent

method. For example, assume that 〈{a}, {b, d}, {e}, {b}, {f}, {a}, {c}〉 is an

incoming streaming log sequence denoted as Si that are generated from the

session time window Wi. In the anomaly prediction process, the consequent

〈{e}, {f}, {d}〉, 〈{f}, {d}〉 and 〈{d}〉 respectively of the rules R2, R3 and R4

cannot be found in the same window Wi, although the antecedent of these rules

are found. In addition, if the consequent of these three rules cannot be found

in the next windows that occur in the de�ned time period, the three prediction

alerts from these three rules becomes false. However, since the longest antecedent

method uses the largest antecedent rule R4 to predict anomalies, the method can

produce a maximum of one false prediction compared to three false predictions

by the variable length antecedent method.

In addition, once an association rule is generated, the algorithm �nds the

support value of the antecedent ante of the association rule ante ⇒ cons. The

number of database scanning is equal to the number of association rules generated

from each rare sequential pattern P. Since in the variable length antecedent



method, a rare sequential pattern P of size-n generates n-1 sequential association

rules, �nding the support value of ante requires n-1 database scanning. However,

the longest antecedent method requires a single database scanning since the

method generates a single association rule from each sequential pattern P of

size-n.

Moreover, although the variable length antecedent method generates the as-

sociation rules that have both high and low con�dence values, the longest an-

tecedent method produces only the association rules of high con�dence values.

Since the longest antecedent method uses the largest antecedent, the support of

the longest antecedent usually becomes low compared to the shorter antecedent

in the database. This low antecedent value contributes to generate high con�-

dence association rules for the longest method. The probability of occurring the

consequent of a high con�dence association rule in the streaming logs is higher

than the low con�dence association rule. This is because the higher the con-

�dence value an association rule have, the stronger correlation the association

rule have between the antecedent and the consequent. For example, if the user

de�ned minimum con�dence value minconf is set to 0.5, the association rules

R2, R3 and R4 among the four association rules, which are generated by the

variable length antecedent method as shown in Table 5.2, from a rare sequen-

tial pattern 〈{a}, {b, d}, {e}, {f}, {d}〉 become valid rules. The con�dence of the

association rules R2 and R3 are 0.5, which means that in 50% times whenever

the antecedent of the association rules R2 and R3 occurs, the consequent of the

association rules R2 and R3 also occurs in the streaming logs. On the other

hand, the con�dence of the association rule R4 is 1.0, which means that in 100%

times whenever the antecedent of the association rule R4 occurs, the consequent

of the association rule R4 also occurs in the streaming logs.

The longest antecedent method generates only a single association rule R4

compared to the variable length antecedent method's four association rules. This

association rule R4 triggers less number of times compared to other rules R2 and

R3 because rule R4 has a longer antecedent than the other rule's antecedents.

This is because the probability of �nding a longer antecedent in the log stream

is rare. So, even if the consequent part of the rule R4 does not occur once the

antecedent has appeared, the rule produces less false predictions. Hence, the rule

R4 generate less false positives. On the other hand, the shorter antecedent rules

R2 and R3 trigger many times in the log stream. It is because the smaller number


of events in the antecedent results in a frequent sequence of events in the log

stream. If the consequent events do not occur after the antecedent has occurred,

the triggered rules become a false prediction. Hence, the more triggers produce

more false positives. Since the longest method provides the system operators

with the less number of prediction alerts and less false positives in comparison

to the variable length antecedent method, this research in this chapter uses the

longest method of anomaly prediction.

5.5 Experimental Evaluation

The experiment conducted in this chapter is to predict and detect possible

anomalies from SCADA streaming logs. Anomalies can be detected from o�-

line SCADA control logs which we have discussed in Chapter 3 and in Chapter

4 of this thesis. In Chapter 3, we have proposed a new approach for �nding

rare sequential patterns that can be used to detect anomalies in SCADA control

logs. In this chapter, we use sequential association rules generated from rare

sequential patterns to predict possible anomaly in the SCADA logs. We used

SCADA control system logs from three physical devices named conveyor belt,

pressure control and water tank to conduct this experiment. Each of the con-

trol device logs are used to create two sets of data, the training dataset and the

testing dataset. The association rules are generated from the training dataset,

while the testing dataset is used to represent the streaming logs. The possible

anomalies are predicted from the streaming logs by using the association rules.

The association rules are applied on streaming logs to predict anomalies. The

rest of this section is given as follows: Section 5.5.1 describes the dataset used for

the experiment, Section 5.5.2 presents the pre-processing required to prepare the

training dataset and the testing dataset that are used as inputs to our proposed

anomaly prediction algorithms. Finally, Section 5.5.3 describes the experimental

methodology.

5.5.1 Dataset

In this experiment, we used SCADA process control logs that were collected

from our scaled SCADA industry test-bed laboratory system. In this chapter

we use the same dataset that were used in Chapter 4 for constraint-based rare

sequential pattern mining experiment. Recall that the logs for the experiment of



the Chapter 4 were generated and collected for 8 hours of operation on the three

SCADA control systems the conveyor belt, the pressure system, and the water

tank control system. The logs were collected by using the on-change method

where the process control events were recorded which had been changed during

the polling time. It means the on-change method did not record every values

that changed, rather in some time interval the values were recorded in the logs.

There were some attacks conducted to disrupt the process control activities

during the 8 hours operation on the three control systems. For example, in the

pressure control system, attack was conducted by changing the pressure threshold

values. In other words, the lower pressure threshold value was changed from the

set value 20 PSI (Pound per Inch) to 25 PSI and the upper threshold value was

changed from set value 40 PSI to 45 PSI. In the conveyor belt control system,

attack was conducted by changing the direction of the diverter paddle, which

resulted in an unexpected change to the direction of the diverter paddle. The

detailed description of these datasets are given in Chapter 4 at Table 4.2, Table

4.3 and Table 4.4 respectively.

5.5.2 Pre-processing

In the pre-processing phase, we divided each of the control system logs into two

sets of logs. In the �rst set, we keep the �rst 40% of the each control system logs.

This 40% is preprocessed to prepare a sequential database which is considered as

the training datasets. The remaining 60% log of each control system is used for

generating a respective the testing dataset. The purpose of the training dataset

is to generate association rules that can be used to predict anomalous events from

the testing dataset. The training dataset is composed of a set of sequences that

are generated from the control logs. However, the testing dataset is composed of

continuous events from the control logs.

In the process of preparing the training dataset, we select the necessary re-

quired features with the help of domain expert knowledge. While selecting the

features, we choose those features that can be changed to conduct attacks on the

control system. For example, the conveyor belt feature Conv_Run_Status holds

the value that indicates the current status of the conveyor belt control system.

It indicates that if the conveyor belt is in running state or in stopping state. If

the value of the feature Conv_Run_Status is 0, it indicates the status of the

conveyor belt is o�. On the other hand, if the value of this feature is −1, it


indicates that the status of the conveyor belt is on. The feature is merged with

its corresponding stored value which represents an event of the control system.

Since this feature holds either 0 or −1, two events {Conv_Run_Status_0} and

{Conv_Run_Status_-1} can be generated from the feature Conv_Run_Status.

These events form the sequences that make up the training dataset.

On the other hand, in the process of generating the testing dataset, we apply

the similar process of selecting necessary features from the training dataset. We

select the same features that are selected for preparing the training dataset. The

reason is that the association rules which are generated from the training dataset

are checked with the testing dataset. If both the training dataset and the testing

dataset do not have the same features, the association rules from the training

dataset may not be found in the testing dataset. Therefore, it is necessary to

have similar set of features both at the training and the testing dataset. For ex-

ample, if three features such as Conv_Run_Status, HMI_Conv_Direction, and

Conv_Read_Conv_HMI_Direction are selected from the conveyor belt control

logs for training dataset, the same set of features are required to be selected for

the testing dataset.

5.5.3 Experimental Methodology

In the experiments we used two datasets, the training and the testing. The

training dataset is used to generate the rare sequential patterns which are later

used to generate sequential association rules. The testing dataset represents

the streaming logs from where possible anomalies are predicted by using the

association rules. The testing dataset represents the streaming events since the

dataset is composed of continuous events from the conveyor belt control logs.

Once the streaming events are recorded, the events are segmented into �xed

time session windows. The events in each window are formed into a sequence so

that the generated rules from the conveyor belt testing dataset can be checked

with the sequence. If the antecedent sequence of an association rule is found in a

sequence generated from a window of streaming events, an anomaly prediction is

made that the consequent of the same association rule is likely to occur next in

the streaming events. So, the consequent is checked in the same window where

the antecedent is found. If the consequent is not found in the same window, the

consequent is checked in the next windows. If the consequent is found either in

the same window where the antecedent found or in the next windows that exist



within the time-span period, the anomaly prediction becomes true. However, if

the consequent is not found either in the same window or in the next windows,

the anomaly prediction is removed from the system.

Firstly, we generate rare sequential patterns from the training datasets. We

set the maximum support threshold value maxsup to 2 to �nd the rare sequential

patterns from each of the three control system training datasets. To generate

rare sequential patterns, we applied our proposed rare sequential pattern mining

algorithms, Algorithm 3.1 and Algorithm 3.2, that were presented in Chapter 3.

We set the maximum support threshold value maxsup to 2 to �nd the rare se-

quential patterns. Once the rare sequential patterns were generated, we used our

proposed algorithm, that is, the Algorithm 5.1 to generate sequential association

rules. The association rules indicate how the antecedent and the consequent of

an association rule are correlated to each other. The valid association rules are

then selected from all the association rules depending on the user de�ned con�-

dence value. In this experiment, we have de�ned the minimum con�dence value

0.9, which signi�es that there exist a strong correlation between the antecedent

and the consequent of an association rule. So, if an antecedent of an association

rule occurs in the streaming logs, it is highly likely that consequent of the same

association rule would occur next in the streaming logs. These rules are then

used to predict possible anomalies from the testing dataset, which represents

the streaming logs. In the anomaly prediction process, we used the proposed

algorithm, Algorithm 5.2, to predict possible anomalies in the streaming logs.

We conducted the anomaly prediction experiment of this chapter to see if the

sequential association rules can be used to predict possible anomaly from SCADA

streaming logs. We also wanted to see if the predicted anomalous events are part

of an actual anomalous patterns. We conducted three experiments Experiment-

1, Experiment-2 and Experiment-3 with the training and the testing datasets

from the three control system logs. The Experiment-1 was conducted with the

conveyor belt control logs. The Experiment-2 was conducted with the pressure

system control logs. Finally, the Experiment-3 was conducted with the water

tank control logs.

In all of these three experiments, we generated sequential association rules

from the respective training datasets. While generating the association rules,

we generated the longest antecedent rules from the rare sequential patterns. We

generated the longest antecedent rules from each of the rare sequential patterns


that have at least two or more events. Once sequential association rules are

generated, we validate the association rules that have the con�dence higher than

the user de�ned minimum con�dence value. The antecedent of the sequential

association rules are then checked if the antecedent can be found in the streaming

logs of the testing dataset. If any antecedent of the sequential association rules

are found in the streaming events, the algorithm will raise a prediction that

the consequent of the rule will occur next in the streaming events. Here, the

streaming event means each individual event that occurs in the control system

logs. The streaming events in the testing dataset are segmented into a �xed time

session window. In every second we segment the events in a window as in the

SCADA control system in every second there occurs some events. Therefore,

if an antecedent is found in a window, the consequent may be found in the

same window after the antecedent, or in the next few windows. The number of

windows where the consequent may appear depend on the time duration where

the sequential association rules remain valid. So, if the consequent is found in

the same window or in the next windows that are within the time duration of

the rule's validity period, the prediction becomes true. It means that an entire

association rule (the antecedent followed by the consequent) is found in the

streaming logs. The time duration is the time-span period where a sequence of

events occur. These sequence of events comprise the size of a sequence in the

training dataset. If the consequent of an association rule is not found within

the time-span duration, then the prediction is removed from the system. The

prediction is removed because the consequent of an association rule did not occur

within the association rule's validity period.

5.6 Results and Analysis

In this section we present the experimental results and analysis. We conducted

our proposed sequential association rules mining algorithm followed by anomaly

prediction on three control system datasets. Firstly, we present and illustrate the

results obtained from the conveyor belt control system in Section 5.6.1. Secondly,

in Section 5.6.2, we discuss the results from the pressure control system. Finally,

in Section 5.6.3, we present and analyse the result from the water tank control

system.



5.6.1 Conveyor-belt Control System

The rare sequential pattern mining algorithm generated 14 rare sequential pat-

terns from the conveyor belt training dataset. There were 11 sequential asso-

ciation rules generated by the Algorithm 5.1 from the generated rare sequen-

tial patterns of the conveyor belt training dataset. Two of the 11 sequen-

tial association rules are shown in Table 5.3. Using these association rules,

the anomaly prediction algorithm, Algorithm 5.2, has generated 45 possible

anomaly predictions. For example, the anomaly prediction using the associa-

tion rule 〈 {HMI_Conv_Reset_-1 }, {Conv_Read_Conv_HMI_Direction_-1

} 〉 ⇒ 〈{HMI_Conv_Direction_-1 }〉 was found in the streaming logs. As a

result, the prediction was successful. This predicted anomalous pattern was

found as an attack pattern. This is because the conveyor belt control system

was reset by changing the value of the event {HMI_Conv_Reset_-1 } from 0

to −1 and then the conveyor belt direction was changed by changing the value

of the event {Conv_Read_Conv_HMI_Direction_-1 } from 0 to −1. These

changes caused to change the conveyor belt direction as re�ected in the event

{HMI_Conv_Direction_-1 } which value were also changed from 0 to −1. This

attack caused objects on the conveyor belt sorted in wrong direction.

Table 5.3: Examples of sequential association rules from three control systems.

Dataset Sequential Association Rules

Conveyor belt

Dataset

〈 {HMI_Conv_Reset_-1 }, {Conv_Read_Conv_HMI_

Direction_-1 } 〉 ⇒ 〈{HMI_Conv_Direction_-1 }〉〈{Conv_Run_Status_-1 }, {HMI_Conv_Reset_-1 } 〉

⇒ 〈{Conv_Run_Status_0 }〉

Pressure control

Dataset

〈{Pipe_Pump_Run_Status_-1}, {HMI_Pipe_Solenoid_

On_SP_41}〉 ⇒ 〈{HMI_Pipe_Solenoid_On_SP_42}〉〈{Pipe_Pump_Run_Status_-1}, {HMI_Pipe_Solenoid_

O�_SP_24}〉 ⇒ 〈{HMI_Pipe_Solenoid_O�_SP_25}〉

Water tank

Dataset

〈{Tank_Level_65 }, {Tank_Level_68}〉 ⇒〈{Tank_Read_Tank_Level_68}〉

〈{Tank_Usage_Level_61}, {Tank_Read_Tank_Level_42}〉 ⇒ 〈{HMI_Tank_Master_Mode_-1}〉

Similarly, the association rule 〈{Conv_Run_Status_-1 }, {HMI_Conv_Reset_-


1 } 〉 ⇒ 〈{Conv_Run_Status_0 }〉 was successfully found in the streaming logs.This pattern is also found as an attack pattern as the conveyor belt was unsched-

uled stopped while the system was in running. The event {Conv_Run_Status_-

1 } shows that the conveyor belt was running state. Then the conveyor belt was

reset as shown in the event {HMI_Conv_Reset_-1 }. After resetting the con-

veyor belt was stopped and the status is shown in the event 〈{Conv_Run_Status_0 }〉, which indicates the conveyor belt was not in running state because the

event value was changed from −1 to 0.

Among the 45 possible anomaly predictions, 32 predictions were found true

positive. Among the 32 true predictions, 3 predictions were found as attack

patterns out of 5 actual attacks on the conveyor belt control system. This means

once an antecedent of an association rule has occurred in the streaming logs,

the corresponding consequent of the same association rule has also occurred

in the streaming logs. The consequent either may have occurred in the same

window where the antecedent has occurred, or the consequent have occurred in

the next windows after the antecedent have occurred in the previous window.

The anomaly prediction true positive rate for the conveyor belt experiment is

71.11%, which is shown in Table 5.4.

Table 5.4: Anomaly predictions from the three control system streaming logs.

ControlSystems

TotalPred.

TruePositive

FalsePositive

True PositiveRate

False PositiveRate

Conveyorbelt

45 32 13 71.11% 28.89%

On the other hand, 13 possible anomaly predictions were found false positive as

the consequent of an association rule neither occurred in the same window, where

the antecedent occurred, nor the consequent occurred in the next windows. So,

the anomaly prediction false positive rate is 28.89%.

5.6.2 Pressure Control System

In pressure control system training dataset, the rare sequential pattern min-

ing algorithm generated 8 rare sequential patterns. From these rare sequen-

tial patterns 6 association rules were generated. The anomaly prediction Al-

gorithm 5.2 has generated 26 anomaly predictions from the pressure control

streaming logs. For example, the anomaly prediction using the association



rule 〈{Pipe_Pump_Run_Status_-1}, {HMI_Pipe_Solenoid_On_SP_41}〉 ⇒〈{HMI_Pipe_Solenoid_On_SP_42}〉, which is shown in Table 5.3, was found

successful in the pressure control system streaming logs. This means that after

�nding the antecedent 〈{Pipe_Pump_Run_Status_-1}, {HMI_Pipe_Solenoid_

On_SP_41}〉 in the streaming log, the consequent 〈{HMI_Pipe_Solenoid_On_

SP_42}〉 was also found in the streaming log. This possible anomalous pattern

was found as attack pattern when veri�ed with the attack dataset. This is an at-

tack pattern as the pressure upper threshold value {HMI_Pipe_Solenoid_On_

SP_41} was found above the set threshold value.


ControlSystems

TotalPred.

TruePositive

FalsePositive

True PositiveRate

False PositiveRate

Pressurecontrol

26 18 8 69.23% 30.77%

In the attack process the upper threshold value was change to 45 from 40. In the

predicted anomalous pattern, it was found the pressure control system was in

running state as indicated by the event {Pipe_Pump_Run_Status_-1}. Then

the next event {HMI_Pipe_Solenoid_On_SP_41} indicates that the current

pressure status on the control system, which is 41. As this pressure value was

above the threshold value, it was predicted that an attack is progressing on the

control system and pressure value is increasing. This was found true as indicated

by the next event {HMI_Pipe_Solenoid_On_SP_42}, which was found in the

streaming log. Among these 26 possible anomaly predictions, 18 predictions were

found true positive. Among the 18 true predictions, 4 predictions were found as

attack patterns out of 6 actual attacks on the pressure control system. The

prediction was true because the antecedent of an association rule was followed

by the consequent of the same rule in the streaming logs. As a result, the anomaly

prediction algorithm's true positive rate is 69.23%, which is shown in Table 5.5.

On the other hand, 8 anomaly predictions were found false positive. Hence, the

false positive prediction rate is 30.77%.

5.6.3 Water Tank Control System

The rare sequential pattern mining algorithm has generated 52 rare sequential

patterns from the water tank training dataset. The association rule mining

5.7. Discussion 159

Algorithm 5.1 then generated 36 sequential association rules from the gener-

ated rare sequential patterns. Some examples of the sequential association rules

are given in Table 5.3. Using these association rules, the Algorithm 5.2 has

predicted 78 possible anomaly predictions on the water tank streaming logs.

Among the anomaly predictions, 49 predictions were found successful because

the consequent of the association rules have occurred in the streaming logs

once the antecedent of the rule has occurred. For example, the anomalous

prediction using the association rule 〈{Tank_Level_65 }, {Tank_Level_68}〉⇒ 〈{Tank_Read_Tank_Level_68}〉 as shown in Table 5.3 was found in the

streaming log. Since 49 predictions were found true, the prediction true positive

rate for the water tank control system is 62.82% as shown in Table 5.6.


ControlSystems

TotalPred.

TruePositive

FalsePositive

True PositiveRate

False PositiveRate

Watertank

78 49 29 62.82% 37.18%

Among the successful predictions, no predictions were found as attack patterns.

This is because the attacks on the water tank control system was �ooding attacks.

On the other hand, 29 predictions were found false as consequent of an association

rule was not found in streaming logs once the antecedent of the rule is found in

the streaming logs. For example, the possible anomaly prediction using the as-

sociation rule 〈{Tank_Usage_Level_61}, {Tank_Read_Tank_Level_42}〉 ⇒〈{HMI_Tank_Master_Mode_-1}〉 was found false positive. This is because theconsequent did not occur in the streaming logs even if the antecedent was found

in the streaming logs. The anomaly false positive prediction rate is 37.18%,

which is shown in Table 5.6.

5.7 Discussion

This section provides the discussion of anomaly prediction method used and the

results obtained from the experiments. In this experiment, we have shown that

possible anomalies can be predicted on the SCADA control system by using

real time streaming logs. We have experimented with three control system logs:

the conveyor belt control system, the pressure control system and the water



tank control system. With each of the control system logs, we conducted the

experiment in three phases. In the �rst phase, we created two datasets, the

training dataset and the testing dataset, from each of the control system logs.

We generated rare sequential patterns from the training dataset. In the second

phase, the rare sequential patterns were used to generate sequential association

rules by which the possible anomaly predictions were performed. Finally, in the

third phase, we generated possible anomaly predictions in the streaming logs by

using the association rules.

The purpose of the association rule mining was to predict possible anomalies

in the streaming logs. Every association rule is composed of two parts the an-

tecedent and the consequent. Although we experimented with the variable length

antecedent method of anomaly prediction, the method did not produce expected

prediction results. In this method, it was found that the method produces large

anomaly prediction. This is because the variable length antecedent method gen-

erated large number of association rules. These association rules were of both

shorter and longer antecedent rules. The shorter antecedent rules were triggered

many times when the antecedent were found in the streaming logs. The shorter

antecedent rules triggered many times because the shorter antecedent becomes

frequent which is found many times in the streaming logs.

Although the shorter antecedent rules triggered many times, the consequent

of the rules did not occur in the streaming logs. This is because the shorter an-

tecedent rules have weak correlation between the antecedent and the consequent.

The weak correlation is due to the low con�dence value of the association rules.

The frequency of the shorter antecedent is high, as the shorter antecedent become

frequent, contribute to generate low con�dence value of the association rule. This

is because the con�dence of an association rule is derived by dividing the fre-

quency of a rare sequential pattern by the frequency of an antecedent. Generally

the rare sequential patterns have a low frequency value. Hence, dividing the low

frequency value by the high frequency value produces low con�dence.

The weak correlation of an association rule means the probability of occurring

the consequent in the streaming logs are low once the antecedent of the rule is

found in the streaming logs. Therefore, even if antecedent of a rule found many

times in the streaming logs, the consequent did not occur next in the stream-

ing logs, which resulted in unsuccessful prediction. Since the variable length

antecedent method generated large predictions, anomaly prediction success rate

5.7. Discussion 161

was lower in comparison to the unsuccessful prediction rate. Moreover, the vari-

able length antecedent method generated redundant association rules from a rare

sequential pattern. For a single anomalous pattern, this method raised multiple

predictions by triggering redundant association rules. Triggering redundant as-

sociation rules not only contributed large anomaly predictions but also consumed

large computational time.

To reduce the number of anomaly predictions and remove the redundant

association rules problem, we applied the longest antecedent method of gener-

ating association rules. In the longest antecedent method, we only generated

the association rules that have the longest antecedent. This method reduced the

number of association rules. The longest antecedent association rules reduced

the possible anomaly predictions by not generating multiple predictions for a

single anomalous pattern. As a result, the longest antecedent method generated

less anomaly predictions compared to the variable length antecedent method.

This longest antecedent method also reduces the number of false positives.

This is because once an alert is raised for possible anomaly prediction using

the longest antecedent of an association rule, the consequent of the same rule is

checked in the streaming logs. If the consequent is found in the streaming logs,

the anomaly prediction is successful or true positive. However, if the anomaly

prediction is found unsuccessful, the prediction becomes false positive. Since the

prediction method uses only the longest antecedent rule, it produces less false

positive. On the other hand, since the variable length antecedent method uses

all possible length antecedent rules, this method generates more false positives

when the anomaly prediction is found unsuccessful.

In the anomaly prediction experiments, we found di�erent successful predic-

tion rate. In conveyor belt experiment, there were 45 anomaly predictions raised

by the algorithm. Among these predictions 32 were found as successful anomaly

predictions, while 13 predictions were found unsuccessful. The successful pre-

diction rate for the algorithm is 71.11% against the unsuccessful prediction rate

28.89%. In the pressure control system experiment, 26 anomaly prediction were

raised by the algorithm. Among these predictions, 18 predictions were found

successful, while 8 predictions were found unsuccessful. The prediction success

rate for the pressure control experiment is 69.23%, while the false prediction rate

is 30.77%. Finally, in water tank experiment, among 78 anomaly predictions

49 predictions were found successful against 29 unsuccessful predictions. The



prediction success rate for the water tank experiment is 62.82%, while the false

prediction rate is 37.18%.

In all of the three control system experiments, it is found that the algorithm's

anomaly prediction success rate is higher in comparison to the unsuccessful pre-

diction rate. However, the false positive still remains high. This is because the

possible anomaly prediction was raised when the antecedent of an association

rule is found in the streaming log window. The antecedent of a rule could be

frequent, although the association rules were generated from a rare sequential

pattern. Therefore, the frequent antecedent of an association rule has appeared

many times in the streaming log windows. However, the consequent did not ap-

pear frequently in the streaming log window as the association rule comprising

both antecedent and consequent was rare. There could be two possibilities: one

is that the consequent never happened and secondly, the consequent occurred,

but it did not occur within the de�ned time-window. Therefore, the possible

anomaly prediction is found false and hence increases the rate of false positive.

The possible anomaly predictions can be further reduced by generating the

association rules only from the rare sequential patterns, which are identi�ed as

an attack patterns. In other words, instead of using all rare sequential patterns,

only the rare sequential patterns that have been identi�ed as attack patterns can

be used to generate association rules. However, this method can not be able to

predict any possible new or zero day anomalies on the system.

In the experiment, it was found that the method was able to identify some

possible anomaly predictions as attack patterns. This is because the attack pat-

tern was found both in the training and in the testing dataset. On the other

hand, a few attacks were not identi�ed by the possible anomaly predictions. This

is because some of the attacks conducted on the control system were found either

in the training dataset or in the testing dataset. Therefore, it is required that

the attack patterns should appear both in the training and the testing dataset

since the possible anomalous patterns are predicted from the testing dataset.

It means to verify if a predicted anomalous pattern is an attack pattern, the

training dataset needs to have some attack patterns which are also found in

the testing dataset. It is noted that in the anomaly prediction experiment, we

cannot swap the training dataset and the testing dataset. This means �rstly

the association rules are generated from the training dataset and then predicted

possible anomalies in the testing dataset. Secondly, the training and the testing

5.8. Summary 163

datasets are interchanged so that association rules can be generated from the

newly interchanged training dataset, and predict possible anomalies from the

newly interchanged testing dataset. The swapping of training and test datasets

cannot be possible because in the SCADA control system, the logs are being

recorded in a sequential manner. It means that the events are recorded in a

chronological order. The training dataset is created from the log events which

occurred before the log events from where the test dataset was created. The

purpose of the training dataset is to generate sequential association rules which

are later used in testing dataset. This means that if the antecedents of the asso-

ciation rules are found in the test dataset, it is predicted that the corresponding

consequent is likely to occur nest in the testing dataset. However, if the train-

ing dataset is swapped with the test dataset, the sequential order of the events

is lost. Hence, the association rules that are generated from the interchanged

training dataset are composed of the log events which occur after the log events

in the interchanged testing dataset. Therefore, swapping between the training

dataset and the testing dataset cannot be possible in the sequential database.

It is learnt that the prediction results could be more accurate if training

and test data is considered in di�erent proportions. However, splitting the data

in di�erent proportion is not feasible for this experiment. This is because the

attacks which were conducted on the control system while generating data were

found either in the training dataset or in the testing dataset. If the percentage

of the training dataset is increased, there will be no attack data remained in

the testing dataset. Hence, the predicted anomalies cannot be veri�ed whether

they are the true attack patterns or not. This means to verify if a predicted

anomalous pattern is an attack pattern, the training dataset needs to have some

attack patterns which are also found in the testing dataset. Therefore, it is

required that the attack patterns should appear both in the training and the

testing dataset since the possible anomalous patterns are predicted from the

testing dataset.

5.8 Summary

This chapter has presented a new approach for sequential association rules mining

method for possible anomaly prediction. This method has been used to predict

and detect possible anomalies on the SCADA control system. The anomaly



prediction method has used the SCADA control system streaming logs. The

experimental results showed that possible anomalies can be predicted by using

sequential association rules. The system is alerted on the possibility of occurring

an anomalous pattern in the streaming logs. The anomaly prediction method

produced less false positives as the method alerted possible anomalies with the

longest antecedent association rule.

The next chapter, Chapter 6, presents the summary of this thesis. This

chapter reviews the contributions and how they have been achieved. Finally,

Chapter 6 is concluded with some recommendations to the future work.

Chapter 6

Conclusion and Future Work

This chapter presents the overall research summary of this thesis. In this sum-

mary we relate the objectives that were outlined in Chapter 1 and how we have

been able to achieve the goals of this thesis. Further, this chapter also presents

the future research directions as to address the problems that have been identi�ed

while conducting this research.

6.1 Research Summary

The detection of anomalies from SCADA control system can be performed by

either analysing control logs or network tra�c packets. In this thesis, we have

analysed SCADA control logs to �nd anomalies. Since SCADA control process

activities are recorded in sequential manner with timestamp tagged with each

events, this thesis used sequential pattern mining approach to analyse logs. The

process activities of SCADA control system are limited and repetitive. As a

result, the normal activities of the SCADA control system are predictive which

produces a regular system behavioural pro�le. The regular system pro�le is

frequent behaviour of the system. However, any abnormal process activities

which are deviated from the normal activities of the control system represents a

rare activities of the control system. These rare activities are the anomalies of

the control system. These anomalies could be deliberate cyber-attacks or system

malfunctions.

165

166 Chapter 6. Conclusion and Future Work

Although anomalies could be frequent and rare in a system, we assumed anoma-

lies are rare events in SCADA control system. So, we used rare sequential pattern

mining method to �nd rare events in the control system. As far as we are aware

there has been no prior work that detect anomalies using rare sequential pattern

by analysing SCADA control logs. Although there has been a single work by

Hadºiosmanovi£ et al. [17] who used water treatment SCADA control logs to

�nd threats in the system, they used rare itemset pattern mining to �nd rare

event in the system. They could not �nd rare sequence of events representing

anomalies in SCADA system. Additionally, Hadºiosmanovi£ et al. [17] could not

identify ordered events which is important in identifying anomalies in SCADA

control system. We have addressed this problem in this thesis using the proposed

rare sequential pattern mining method. In this thesis, we have achieved three

key objectives set for our research to detect and predict anomalies in SCADA

control system logs by using rare sequential pattern mining. We conclude this

thesis by presenting a summary of how we have achieved the objectives of our

research stated in Chapter 1.

Objective 1: To design and develop a method for �nding anomalies that are

rare in SCADA control systems. This research objective is achieved with the

use of rare sequential pattern mining approach. The detailed of this approach is

discussed in Chapter 3 of this thesis. In this chapter, we proposed and developed

a new method for �nding rare sequential patterns from SCADA system. Using

this method we analysed the SCADA control logs and discovered rare sequen-

tial patterns that represent anomalous behaviour of the control system. We also

found the minimal or the smallest rare sequential patterns as well as the max-

imal or the longest rare sequential patterns in an equivalence class, where the

patterns share the same frequency and occur in the same number of sequences.

The purpose of �nding minimal and maximal rare anomalous patterns were to

evaluate which rare patterns give greater understanding to detect anomalies. It

was found that in the SCADA domain the maximal rare sequential pattern gives

more context to understand the complete scenario to identify the anomalous pat-

tern than the minimal rare sequential patterns which indicate the starting point

of the anomalous pattern. We also analysed that the order of events in SCADA

control logs are important to detect anomalies.

6.1. Research Summary 167

Objective 2: To improve the e�ciency of the rare sequential pattern mining

algorithm without losing accuracy by introducing constraints. We have achieved

this objective by the constraint-based rare sequential pattern mining algorithm.

The detailed of this algorithm is presented in Chapter 4. In this chapter, we

used three constraints: the time-span gap constraint, the feature reduction con-

straint, and the algorithmic constraint. The time-span gap constraint is imple-

mented with the sequential database in the pre-processing stage. It means that

the time-span constraint is used when segmenting the sequences for creating the

sequential database from the SCADA control logs. The time-span gap constraint

is used to �nd the signi�cant rare sequential pattern where the events of the pat-

tern occur in a de�ned time period. This constraint is implemented by selecting

the episodic events from the control logs. As a result, events are not overlapped

into consecutive sequences. Secondly, the feature reduction constraint was used

to reduce the number of unique events in the sequence database. This reduced

unique events contributed to generate less number of possible candidate super

sequences. The outcome revealed that this less candidate super sequence con-

tributed to achieve the e�ciency of the constraint-based rare sequential pattern

mining algorithm by taking less computational time.

The third constraint used is the algorithmic constraint, which was imple-

mented with rare sequential pattern mining algorithm. Here, two constraints

were used: the �rst constraint was to check the size of the candidate super se-

quence before the candidate sequence is searched in the database. Since larger

sized candidate super sequence can not be contained in the smaller sized sequence

of the database, the scanning of the smaller sized sequences can be avoided. The

second constraint implemented was to avoid unwanted database scan. This was

done by not searching the candidate super sequence in the remaining sequences

of the database once the candidate super sequence is found in a sequence in the

database. The outcome showed that these two algorithmic constraint contributed

to reduce the computational time of the constraint-based rare sequential pattern

mining algorithm. Further, the accuracy of anomalies detected by the constraint-

based rare sequential pattern mining algorithm was compared with the anomalies

detected by without constrained rare sequential pattern mining algorithm. The

result revealed that constraint-based rare sequential pattern mining algorithm

found the same number of anomalies as found by the rare sequential pattern

mining algorithm.

168 Chapter 6. Conclusion and Future Work

Objective 3: To provide an anomaly prediction method that can extend the work

of the rare sequential pattern mining algorithm. We have achieved this objective

by using a new approach to anomaly prediction method. This method is based

on sequential association rules mining technique. We extended our proposed rare

sequential pattern mining algorithm to predict anomalies in streaming SCADA

control logs. The rare sequential pattern mining algorithm is used to detect

anomalies in static SCADA control logs, which we achieved in objective 1. This

method cannot predict possible anomalies before it occurs on the live SCADA

system. To achieve this we developed and implemented an anomaly prediction

algorithm that raises an alert to the system once the precursor or antecedent of an

anomalous pattern is found the incoming streaming logs. The detailed analysis

and implementation of this method is presented in Chapter 5 of this thesis. We

have validated the results by using the training and the testing dataset of the

the SCADA control logs. It was found that the anomaly prediction method is

e�ective in predicting the anomalies on streaming SCADA control logs before it

occurs on the system.

6.2 Future Research Directions

This thesis has mainly focused on implementing the rare sequential pattern min-

ing algorithm on SCADA control system to detect and predict anomalies. Since

the proposed rare sequential pattern mining method is generic, it can be imple-

mented in other domains for �nding anomalies. In Chapter 3 an analysis of gen-

erating rare sequential patterns to detect anomaly was discussed. In the process

of generating rare patterns, all possible candidate sequence patterns are gener-

ated from a rare sequential patterns. Most of the candidate sequence patterns

are found as non-existent patterns in the database. This is because while gener-

ating sequential candidate patterns, the events are placed in di�erent position of

a sequence. As in events occur in a sequential manner, it is highly unlikely that

the events that happen later in a sequence can go before other events. Therefore,

most of the candidate sequential patterns are found non-existent in the database.

These non-existent patterns cost large computational time and space. In future

research, heuristics can be applied while generating candidate sequential pattern

to reduce the non-existent patterns so that computational time can be improved.

6.2. Future Research Directions 169

We have used Apriori method to generate rare sequential generator patterns.

The e�ciency of the Apriori method is not optimum. This is because the Apri-

ori method generates candidate generator patterns. In sequential pattern mining

it is time consuming to generate candidate sequence pattern as compared to the

itemset pattern mining. Hence, our proposed rare sequential pattern mining al-

gorithm could not provide e�cient method of �nding rare sequential pattern.

Therefore, the e�ciency of the proposed rare sequential pattern mining algo-

rithm can be further improved by implementing our proposed algorithm using

the pattern growth method. Moreover, in this thesis, we have used SCADA con-

trol system logs to �nd anomalies by using our proposed rare sequential pattern

mining algorithm. As our proposed method is a generic approach to �nd rare

sequential patterns, this method can be used with standard IT network to �nd

rare behaviour in the IT networks. To �nd rare activities in the IT network,

packet data can be analysed to detect anomalies in the network.

Finally, our proposed anomaly prediction method was not e�ective as it pro-

duces high false positive. This is because the in the proposed method, the pos-

sible anomalous pattern �ag is raised once the antecedent of a rule is found in

the streaming logs. However, not all antecedent of the association rule could be

a part of anomalous pattern. Therefore, the domain expertise can identify the

rules that have been identi�ed as an anomalous pattern. These rules then can

be further reduced so that the they can be used for anomaly prediction. As a

result, it is possible that the false positive can be reduced. In future research

the proposed method can be applied in standard IT network to predict possible

anomalies by analysing the streaming packets.

Bibliography

[1] S. Collins and S. McCombie, �Stuxnet: the emergence of a new cyber

weapon and its implications,� Journal of Policing, Intelligence and Counter

Terrorism, vol. 7, no. 1, pp. 80�91, 2012.

[2] R. M. van der Knij�, �Control systems/SCADA forensics, what's the dif-

ference?,� Digital Investigation, vol. 11, no. 3, pp. 160�174, 2014.

[3] A. A. Cardenas, S. Amin, Z.-S. Lin, Y.-L. Huang, C.-Y. Huang, and S. Sas-

try, �Attacks against process control systems: risk assessment, detection,

and response,� in Proceedings of the 6th ACM Symposium on Information,

Computer and Communications Security, pp. 355�366, ACM.

[4] �Critical Foundations�Protecting America's Infrastructures. Report of the

president's commission on critical infrastructure protection.� Available

from https://fas.org/sgp/library/pccip.pdf. Accessed 11 July 2018.

[5] P. Pederson, D. Dudenhoe�er, S. Hartley, and M. Permann, �Critical in-

frastructure interdependency modeling: a survey of us and international

research,� Idaho National Laboratory, pp. 1�20, 2006.

[6] E. Carter, CCSP Self-study: Cisco Secure Intrusion Detection System

(CSIDS). Cisco Press, 2004.

[7] J. Weiss, �Cyber security research and development,� report, KEMA Inc.,

2008.

[8] B. Miller and D. Rowe, �A survey scada of and critical infrastructure inci-

dents,� in Proceedings of the 1st Annual conference on Research in infor-

mation technology, pp. 51�56, ACM, 2012.

[9] M. D. Cavelty, Cyber-security and threat politics: US e�orts to secure the

information age. Routledge, 2007.

170

https://fas.org/sgp/library/pccip.pdf

BIBLIOGRAPHY 171

[10] J. Guan, J. H. Graham, and J. L. Hieb, �A digraph model for risk iden-

ti�cation and mangement in scada systems,� in Intelligence and Security

Informatics (ISI), 2011 IEEE International Conference on, pp. 150�155,

IEEE, 2011.

[11] K. Wilhoit, �The scada that didn't cry wolf,� Trend Micro Inc., White

Paper, 2013.

[12] �Year in Review: How Did the Cyberthreat Landscape

Change in 2017?.� https://securityintelligence.com/

year-in-review-how-did-the-cyberthreat-landscape-change-in-2017/,

2017 (accessed April 21, 2018).

[13] �Security attacks on industrial control systems.� Available from

https://www-01.ibm.com/common/ssi/cgi-bin/ssialias?htmlfid=

SEL03046USEN, 2015 (accessed April 21, 2018).

[14] D. Hadºiosmanovi£, D. Bolzoni, S. Etalle, and P. Hartel, �Challenges and

opportunities in securing industrial control systems,� in Complexity in En-

gineering (COMPENG), 2012, pp. 1�6, IEEE.

[15] I. Garitano, R. Uribeetxeberria, and U. Zurutuza, �A review of scada

anomaly detection systems,� in Proceedings of the 6th International Con-

ference on Soft Computing Models in Industrial and Environmental Appli-

cations, pp. 357�366, Springer.

[16] D. Hadºiosmanovi£, D. Bolzoni, and P. Hartel, �Towards securing scada

systems against process-related threats,� 2010.

[17] Hadºiosmanovi£, Dina and Bolzoni, Damiano and Hartel, Pieter H, �A log

mining approach for process monitoring in SCADA,� Int. J. of Inform.

Security, vol. 11, no. 4, pp. 231�251, 2012.

[18] J. Gao, J. Liu, B. Rajan, R. Nori, B. Fu, Y. Xiao, W. Liang, and

C. Philip Chen, �Scada communication and security issues,� Security and

Communication Networks, vol. 7, no. 1, pp. 175�194, 2014.

[19] Y. Ebata, H. Hayashi, Y. Hasegawa, S. Komatsu, and K. Suzuki, �Develop-

ment of the intranet-based scada (supervisory control and data acquisition

https://securityintelligence.com/year-in-review-how-did-the-cyberthreat-landscape-change-in-2017/

https://securityintelligence.com/year-in-review-how-did-the-cyberthreat-landscape-change-in-2017/

https://www-01.ibm.com/common/ssi/cgi-bin/ssialias?htmlfid=SEL03046USEN

https://www-01.ibm.com/common/ssi/cgi-bin/ssialias?htmlfid=SEL03046USEN

172 BIBLIOGRAPHY

system) for power system,� in Power Engineering Society Winter Meeting,

2000. IEEE, vol. 3, pp. 1656�1661, IEEE.

[20] A. Giani, G. Karsai, T. Roosta, A. Shah, B. Sinopoli, and J. Wiley, �A

testbed for secure and robust scada systems,� ACM SIGBED Review, vol. 5,

no. 2, p. 4, 2008.

[21] M. Cheminod, L. Durante, and A. Valenzano, �Review of Security Issues in

Industrial Networks,� IEEE Trans. on Ind. Informat, vol. 9, no. 1, pp. 277�

293, 2013.

[22] V. Chandola, A. Banerjee, and V. Kumar, �Anomaly detection: A survey,�

ACM Computing Surveys (CSUR), vol. 41, no. 3, p. 15, 2009.

[23] D. E. Denning, �An intrusion-detection model,� IEEE Transactions on

Software Engineering, no. 2, pp. 222�232, 1987.

[24] Lazarevic, Aleksandar and Ertöz, Levent and Kumar, Vipin and Ozgur,

Aysel and Srivastava, Jaideep, �A comparative study of anomaly detection

schemes in network intrusion detection,� in SDM, pp. 25�36, SIAM.

[25] J. Verba and M. Milvich, �Idaho national laboratory supervisory control

and data acquisition intrusion detection system (scada ids),� in Technolo-

gies for Homeland Security, 2008 IEEE Conference on, pp. 469�473, IEEE.

[26] S. Manganaris, M. Christensen, D. Zerkle, and K. Hermiz, �A data mining

analysis of RTID alarms,� Computer Networks, vol. 34, no. 4, pp. 571�577,

2000.

[27] C. Clifton and G. Gengo, �Developing custom intrusion detection �lters

using data mining,� in IEEE Proc. 21st Century Military Commun., vol. 1,

pp. 440�443, 2000.

[28] K. Julisch and M. Dacier, �Mining intrusion detection alarms for action-

able knowledge,� in Proceedings of the eighth ACM SIGKDD international

conference on Knowledge discovery and data mining, pp. 366�375, ACM,

2002.

[29] K. Sequeira and M. Zaki, �Admit: anomaly-based data mining for intru-

sions,� in Proceedings of the eighth ACM SIGKDD International Confer-

ence on Knowledge Discovery and Data Mining, pp. 386�395, ACM.

BIBLIOGRAPHY 173

[30] Y. Fan, Y. Ye, and L. Chen, �Malicious sequential pattern mining for au-

tomatic malware detection,� Expert Systems with Applications, vol. 52,

pp. 16�25, 2016.

[31] A. Patcha and J.-M. Park, �An overview of anomaly detection techniques:

Existing solutions and latest technological trends,� Computer Networks,

vol. 51, no. 12, pp. 3448�3470, 2007.

[32] T. Shon and J. Moon, �A hybrid machine learning approach to network

anomaly detection,� Information Sciences, vol. 177, no. 18, pp. 3799�3821,

2007.

[33] S. Noel, D. Wijesekera, and C. Youman, Modern intrusion detection, data

mining, and degrees of attack guilt, pp. 1�31. Springer, 2002.

[34] A. Wespi, M. Dacier, H. Debar, and M. M. Nassehi, �Audit trail pattern

analysis for detecting suspicious process behavior,� in Proceedings of RAID

98, Workshop on Recent Advances in Intrusion Detection, 1998.

[35] K. J. Cios and L. A. Kurgan, Trends in data mining and knowledge dis-

covery, pp. 1�26. Springer, 2005.

[36] W. Lee and S. J. Stolfo, �Data mining approaches for intrusion detection,�

in 7th USENIX Security Symposium.

[37] C. Shearer, �The crisp-dm model: the new blueprint for data mining,�

Journal of Data Warehousing, vol. 5, no. 4, pp. 13�22, 2000.

[38] H. Wang, Exploring Intrinsic Structures from Samples: Supervised, Unsu-

pervised, and Semisupervised Frameworks. Thesis, 2007.

[39] C. Kemp, T. L. Gri�ths, S. Stromsten, and J. B. Tenenbaum, �Semi-

supervised learning with trees,� in Advances in neural information pro-

cessing systems, pp. 257�264, 2004.

[40] Z. Xiaojin, �Semi-Supervised Learning Tutorial.� http://pages.cs.wisc.

edu/~jerryzhu/pub/sslicml07.pdf, 2007 (accessed April 08, 2018).

[41] R. Agrawal and R. Srikant, �Fast algorithms for mining association rules in

large databases,� in Proceedings of International Conference on Very Large

Databases (VLDB '94), pp. 487�499, 1994.

http://pages.cs.wisc.edu/~jerryzhu/pub/sslicml07.pdf

http://pages.cs.wisc.edu/~jerryzhu/pub/sslicml07.pdf

174 BIBLIOGRAPHY

[42] J. Pei and J. Han, �Constrained frequent pattern mining: a pattern-growth

view,� ACM SIGKDD Explorations Newsletter, vol. 4, no. 1, pp. 31�39,

2002.

[43] S. Brin, R. Motwani, and C. Silverstein, �Beyond market baskets: Gener-

alizing association rules to correlations,� in Acm Sigmod Record, vol. 26,

pp. 265�276, ACM, 1997.

[44] C. Silverstein, S. Brin, R. Motwani, and J. Ullman, �Scalable techniques for

mining causal structures,� Data Mining and Knowledge Discovery, vol. 4,

no. 2-3, pp. 163�192, 2000.

[45] J. Pei, J. Han, and W. Wang, �Constraint-based sequential pattern mining:

the pattern-growth methods,� Journal of Intelligent Information Systems,

vol. 28, no. 2, pp. 133�160, 2007.

[46] B. Lent, A. Swami, and J. Widom, �Clustering association rules,� in Data

Engineering, 1997. Proceedings. 13th International Conference on, pp. 220�

231, IEEE, 1997.

[47] G. Dong and J. Li, �E�cient mining of emerging patterns: Discovering

trends and di�erences,� in Proceedings of the �fth ACM SIGKDD inter-

national conference on Knowledge discovery and data mining, pp. 43�52,

ACM, 1999.

[48] H. Mannila, H. Toivonen, and A. I. Verkamo, �Discovery of frequent

episodes in event sequences,� Data mining and knowledge discovery, vol. 1,

no. 3, pp. 259�289, 1997.

[49] J. Han, J. Pei, and M. Kamber, Data mining: concepts and techniques.

Elsevier, 2011.

[50] R. Agrawal and R. Srikant, �Mining sequential patterns,� in Data Engi-

neering, 1995. Proceedings of the Eleventh International Conference on,

pp. 3�14, IEEE, 1995.

[51] C. H. Mooney and J. F. Roddick, �Sequential pattern mining�approaches

and algorithms,� ACM Computing Surveys (CSUR), vol. 45, no. 2, p. 19,

2013.

BIBLIOGRAPHY 175

[52] S. M. Vishal, �A survey on sequential pattern mining algorithms,� Inter-

national Journal of Computer Science and Information Technologies (IJC-

SIT), vol. 5, no. 2, 2014.

[53] J. Han, H. Cheng, D. Xin, and X. Yan, �Frequent pattern mining: cur-

rent status and future directions,� Data Mining and Knowledge Discovery,

vol. 15, no. 1, pp. 55�86, 2007.

[54] J. Han, J. Pei, and Y. Yin, �Mining frequent patterns without candidate

generation,� in ACM SIGMOD Record, vol. 29, pp. 1�12, ACM.

[55] R. Agrawal, R. Srikant, et al., �Fast algorithms for mining association

rules,� in Proc. 20th int. conf. very large data bases, VLDB, vol. 1215,

pp. 487�499, 1994.

[56] P. Fournier-Viger, J. C.-W. Lin, B. Vo, T. T. Chi, J. Zhang, and H. B.

Le, �A survey of itemset mining,� Wiley Interdisciplinary Reviews: Data

Mining and Knowledge Discovery, vol. 7, no. 4, 2017.

[57] D. Brauckho�, X. Dimitropoulos, A. Wagner, and K. Salamatian,

�Anomaly extraction in backbone networks using association rules,� in Pro-

ceedings of the 9th ACM SIGCOMM conference on Internet measurement,

pp. 28�34, ACM, 2009.

[58] B. Fernando, E. Fromont, and T. Tuytelaars, �E�ective use of frequent

itemset mining for image classi�cation,� in European conference on com-

puter vision, pp. 214�227, Springer, 2012.

[59] E. Glatz, S. Mavromatidis, B. Ager, and X. Dimitropoulos, �Visualizing

big network tra�c data using frequent pattern mining and hypergraphs,�

Computing, vol. 96, no. 1, pp. 27�38, 2014.

[60] J. Han, J. Pei, Y. Yin, and R. Mao, �Mining frequent patterns without

candidate generation: A frequent-pattern tree approach,� Data mining and

knowledge discovery, vol. 8, no. 1, pp. 53�87, 2004.

[61] M. J. Zaki, �Scalable algorithms for association mining,� IEEE transactions

on knowledge and data engineering, vol. 12, no. 3, pp. 372�390, 2000.

[62] Z.-H. Deng and S.-L. Lv, �Fast mining frequent itemsets using nodesets,�

Expert Systems with Applications, vol. 41, no. 10, pp. 4505�4512, 2014.

176 BIBLIOGRAPHY

[63] Y. W. T. Pramono et al., �Anomaly-based intrusion detection and preven-

tion system on website usage using rule-growth sequential pattern analysis:

Case study: Statistics of indonesia (bps) website,� in Advanced Informatics:

Concept, Theory and Application (ICAICTA), 2014 International Confer-

ence of, pp. 203�208, IEEE, 2014.

[64] Y. J. M. Pokou, P. Fournier-Viger, and C. Moghrabi, �Authorship attribu-

tion using small sets of frequent part-of-speech skip-grams.,� in FLAIRS

Conference, pp. 86�91, 2016.

[65] P. Fournier-Viger, J. C.-W. Lin, R. U. Kiran, Y. S. Koh, and R. Thomas,

�A survey of sequential pattern mining,� Data Science and Pattern Recog-

nition, vol. 1, no. 1, pp. 54�77, 2017.

[66] R. Srikant and R. Agrawal, �Mining sequential patterns: Generalizations

and performance improvements,� in International Conference on Extending

Database Technology, pp. 1�17, Springer, 1996.

[67] M. J. Zaki, �Spade: An e�cient algorithm for mining frequent sequences,�

Machine learning, vol. 42, no. 1-2, pp. 31�60, 2001.

[68] A. Silva and C. Antunes, �Constrained pattern mining in the new era,�

Knowledge and Information Systems, vol. 47, no. 3, pp. 489�516, 2016.

[69] R. Srikant, Q. Vu, and R. Agrawal, �Mining association rules with item

constraints.,� in Kdd, vol. 97, pp. 67�73, 1997.

[70] J. Han, H. Cheng, D. Xin, and X. Yan, �Frequent pattern mining: current

status and future directions,� Data mining and knowledge discovery, vol. 15,

no. 1, pp. 55�86, 2007.

[71] Y.-L. Chen and Y.-H. Hu, �Constraint-based sequential pattern mining:

The consideration of recency and compactness,� Decision Support Systems,

vol. 42, no. 2, pp. 1203�1215, 2006.

[72] J. Pei, J. Han, and L. V. Lakshmanan, �Mining frequent itemsets with

convertible constraints,� in Data Engineering, 2001. Proceedings. 17th In-

ternational Conference on, pp. 433�442, IEEE, 2001.

BIBLIOGRAPHY 177

[73] Y.-L. Chen, M.-C. Chiang, and M.-T. Ko, �Discovering time-interval se-

quential patterns in sequence databases,� Expert Systems with Applica-

tions, vol. 25, no. 3, pp. 343�354, 2003.

[74] M. Garofalakis, R. Rastogi, and K. Shim, �Mining sequential patterns with

regular expression constraints,� IEEE Transactions on knowledge and data

engineering, vol. 14, no. 3, pp. 530�552, 2002.

[75] C. M. Antunes and A. L. Oliveira, �Inference of sequential association rules

guided by context-free grammars,� in International Colloquium on Gram-

matical Inference, pp. 1�13, Springer, 2002.

[76] M. Wojciechowski and M. Zakrzewicz, �Dataset �ltering techniques in

constraint-based frequent pattern mining,� in Pattern detection and dis-

covery, pp. 77�91, Springer, 2002.

[77] M. J. Zaki, �Sequence mining in categorical domains: incorporating con-

straints,� in Proceedings of the ninth international conference on Informa-

tion and knowledge management, pp. 422�429, ACM, 2000.

[78] G. Piatetsky-Shapiro, �Discovery, analysis, and presentation of strong

rules,� Knowledge discovery in databases, pp. 229�238, 1991.

[79] R. Agrawal, T. Imieli«ski, and A. Swami, �Mining association rules between

sets of items in large databases,� in Acm sigmod record, vol. 22, pp. 207�

216, ACM, 1993.

[80] P.-N. Tan, M. Steinbach, and V. Kumar, �Association analysis: basic con-

cepts and algorithms,� Introduction to Data mining, pp. 327�414, 2005.

[81] P. Fournier-Viger, T. Gueniche, S. Zida, and V. S. Tseng, �Erminer: se-

quential rule mining using equivalence classes,� in International Symposium

on Intelligent Data Analysis, pp. 108�119, Springer, 2014.

[82] Ö. Çelebi, E. Zeydan, �. Ar�, Ö. �leri, and S. Ergüt, �Alarm sequence rule

mining extended with a time con�dence parameter,� in IEEE International

Conference on Data Mining (ICDM), 2014.

[83] L. Ong, M. Bergés, and H. Y. Noh, �Exploring sequential and association

rule mining for pattern-based energy demand characterization,� in Proceed-

178 BIBLIOGRAPHY

ings of the 5th ACM Workshop on Embedded Systems For Energy-E�cient

Buildings, pp. 1�2, ACM, 2013.

[84] Wang Yong, Li Zhanhuai, and ZhangYang, �Mining sequential association-

rule for improving web document prediction,� in Sixth International Con-

ference on Computational Intelligence and Multimedia Applications (IC-

CIMA'05), pp. 146�151, Aug 2005.

[85] M. Naedele and O. Biderbost, �Human-assisted intrusion detection for pro-

cess control systems,� in Proceedings of the Second International Confer-

ence on Applied Cryptography and Network Security, pp. 216�225.

[86] C. Balducelli, L. Lavalle, and G. Vicoli, �Novelty detection and manage-

ment to safeguard information-intensive critical infrastructures,� Interna-

tional Journal of Emergency Management, vol. 4, no. 1, pp. 88�103, 2007.

[87] R. Vaarandi, �A data clustering algorithm for mining patterns from event

logs,� in Proceedings of the 2003 IEEE Workshop on IP Operations and

Management (IPOM), pp. 119�126.

[88] F. Salfner, S. Tschirpke, and M. Malek, �Comprehensive log�les for au-

tonomic systems,� in Proceedings of the 18th International Conference on

Parallel and Distributed Processing Symposium, p. 211, IEEE.

[89] B. Zhu, A. Joseph, and S. Sastry, �A taxonomy of cyber attacks on scada

systems,� in Proceeding of the 4th International Conference on Cyber, Phys-

ical and Social Computing, pp. 380�388, IEEE.

[90] S. Cheung, B. Dutertre, M. Fong, U. Lindqvist, K. Skinner, and A. Valdes,

�Using model-based intrusion detection for SCADA networks,� in Proc. of

the SCADA Security Scienti�c Symp., vol. 46, pp. 1�12, 2007.

[91] R. R. R. Barbosa, R. Sadre, and A. Pras, �A �rst look into scada network

tra�c,� in Network Operations and Management Symposium (NOMS),

2012 IEEE, pp. 518�521, IEEE.

[92] A. Valdes and S. Cheung, �Communication pattern anomaly detection in

process control systems,� in Technologies for Homeland Security, 2009.

HST'09. IEEE Conference on, pp. 22�29, IEEE.

BIBLIOGRAPHY 179

[93] R. A. Kemmerer and G. Vigna, �Intrusion detection: A brief history and

overview (supplement to computer magazine),� Computer, no. 4, pp. 27�30,

2002.

[94] E. Bloedorn, A. D. Christiansen, W. Hill, C. Skorupka, L. M. Talbot, and

J. Tivel, �Data mining for network intrusion detection: How to get started,�

report, MITRE Technical Report, 2001.

[95] D. Barbara, N. Wu, and S. Jajodia, �Detecting Novel Network Intrusions

Using Bayes Estimators,� in 1st SIAM Conf. on Data Mining, pp. 1�17,

2001.

[96] K. Sujatha and K. R. S. Rao, �A survey on infrequent pattern mining,�

International Journal of Advances in Engineering & Technology, vol. 6,

no. 4, p. 1728, 2013.

[97] B. Saha, M. Lazarescu, and S. Venkatesh, �Infrequent item mining in mul-

tiple data streams,� in Data Mining Workshops, 2007. ICDM Workshops

2007. Seventh IEEE International Conference on, pp. 569�574, IEEE.

[98] L. Szathmary, A. Napoli, and P. Valtchev, �Towards rare itemset mining,�

in 19th IEEE Int. Conf. on Tools with Arti�cial Intell.(ICTAI 2007), vol. 1,

pp. 305�312, 2007.

[99] R. Vaarandi, Tools and Techniques for Event Log Analysis. Tallinn Uni-

versity of Technology Press, 2005.

[100] M.-S. Chen, J. Han, and P. S. Yu, �Data mining: an overview from a

database perspective,� IEEE Transactions on Knowledge and data Engi-

neering, vol. 8, no. 6, pp. 866�883, 1996.

[101] C. S. Hemalatha, V. Vaidehi, and R. Lakshmi, �Minimal infrequent pattern

based approach for mining outliers in data streams,� Expert Systems with

Applications, vol. 42, no. 4, pp. 1998�2012, 2015.

[102] A. Rahman, Y. Xu, K. Radke, and E. Foo, �Finding Anomalies in SCADA

Logs Using Rare Sequential Pattern Mining,� in International Conference

on Network and System Security, pp. 499�506, Springer, 2016.

180 BIBLIOGRAPHY

[103] P. Fournier-Viger, A. Gomariz, M. �ebek, and M. Hlosta, VGEN: Fast

Vertical Mining of Sequential Generator Patterns, pp. 476�488. Springer,

2014.

[104] N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal, �E�cient mining of as-

sociation rules using closed itemset lattices,� Information systems, vol. 24,

no. 1, pp. 25�46, 1999.

[105] H. Mannila and H. Toivonen, �Levelwise search and borders of theories in

knowledge discovery,� Data mining and knowledge discovery, vol. 1, no. 3,

pp. 241�258, 1997.

[106] S. Yi, T. Zhao, Y. Zhang, S. Ma, and Z. Che, �An e�ective algorithm

for mining sequential generators,� Procedia Engineering, vol. 15, pp. 3653�

3657, 2011.

[107] C. Gao, J. Wang, Y. He, and L. Zhou, �E�cient mining of frequent sequence

generators,� in Proceedings of the 17th international conference on World

Wide Web, pp. 1051�1052, ACM, 2008.

[108] P. Fournier-Viger, A. Gomariz, T. Gueniche, A. Soltani, C. Wu., and V. S.

Tseng, �SPMF: a Java Open-Source Pattern Mining Library,� Journal of

Machine Learning Research (JMLR), vol. 15, pp. 3389�3393, 2014.

[109] D. Hawkins, Identi�cation of Outliers. Chapman and Hall, London, 1980.

[110] R. Kaur and S. Singh, �A survey of data mining and social network anal-

ysis based anomaly detection techniques,� Egyptian Informatics Journal,

vol. 17, no. 2, pp. 199�216, 2016.

[111] W. Lee, S. J. Stolfo, et al., �Data mining approaches for intrusion detec-

tion.,� in Usenix security, 1998.

[112] S. Pan, T. Morris, and U. Adhikari, �Developing a hybrid intrusion detec-

tion system using data mining for power systems,� IEEE Transactions on

Smart Grid, vol. 6, no. 6, pp. 3104�3113, 2015.

[113] S. Bistarelli and F. Bonchi, �Soft constraint based pattern mining,� Data

& Knowledge Engineering, vol. 62, no. 1, pp. 118�137, 2007.

BIBLIOGRAPHY 181

[114] R. T. Ng, L. V. Lakshmanan, J. Han, and A. Pang, �Exploratory min-

ing and pruning optimizations of constrained associations rules,� in ACM

Sigmod Record, vol. 27, pp. 13�24, ACM, 1998.

[115] J.-F. Boulicaut and B. Jeudy, �Constraint-based data mining,� in Data

mining and knowledge discovery handbook, pp. 339�354, Springer, 2009.

[116] J. Han, L. V. Lakshmanan, and R. T. Ng, �Constraint-based, multidimen-

sional data mining,� Computer, vol. 32, no. 8, pp. 46�50, 1999.

[117] R. J. Bayardo, R. Agrawal, and D. Gunopulos, �Constraint-based rule min-

ing in large, dense databases,� Data mining and knowledge discovery, vol. 4,

no. 2-3, pp. 217�240, 2000.

[118] V. Grossi, A. Romei, and F. Turini, �Survey on using constraints in data

mining,� Data Mining and Knowledge Discovery, vol. 31, no. 2, pp. 424�

464, 2017.

[119] Y.-L. Chen and Y.-H. Hu, �The consideration of recency and compactness

in sequential pattern mining,� in Proceedings of the second workshop on

Knowledge Economy and Electronic Commerce, vol. 42, pp. 1203�1215,

2006.

[120] M. N. Garofalakis, R. Rastogi, and K. Shim, �Spirit: Sequential pattern

mining with regular expression constraints,� in VLDB, vol. 99, pp. 7�10,

1999.

[121] S. Parthasarathy, M. J. Zaki, M. Ogihara, and S. Dwarkadas, �Incremental

and interactive sequence mining,� in Proceedings of the eighth international

conference on Information and knowledge management, pp. 251�258, ACM,

1999.

[122] N. A. K. Desai and A. Ganatra, �E�cient constraint-based sequential

pattern mining (spm) algorithm to understand customersâ�� buying be-

haviour from time stamp-based sequence dataset,� Cogent Engineering,

vol. 2, no. 1, p. 1072292, 2015.

[123] C. Antunes and A. L. Oliveira, �Generalization of pattern-growth meth-

ods for sequential pattern mining with gap constraints,� in International

182 BIBLIOGRAPHY

Workshop on Machine Learning and Data Mining in Pattern Recognition,

pp. 239�251, Springer, 2003.

[124] J. Han, J. Pei, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, and M. Hsu,

�Pre�xspan: Mining sequential patterns e�ciently by pre�x-projected pat-

tern growth,� in proceedings of the 17th international conference on data

engineering, pp. 215�224, 2001.

[125] T. Morzy, M. Wojciechowski, and M. Zakrzewicz, �E�cient constraint-

based sequential pattern mining using dataset �ltering techniques,� in

Databases and Information Systems II, pp. 297�309, Springer, 2002.

[126] J. Pei, J. Han, and W. Wang, �Mining sequential patterns with constraints

in large databases,� in Proceedings of the eleventh international conference

on Information and knowledge management, pp. 18�25, ACM, 2002.

[127] J. Zhu, H. Wu, and G. Gao, �An e�cient method of web sequential pattern

mining based on session �lter and transaction identi�cation,� Journal of

Networks, vol. 5, no. 9, p. 1017, 2010.

[128] D. Pyle, Data preparation for data mining, vol. 1. morgan kaufmann, 1999.

[129] R. Ranjan and G. Sahoo, �A new clustering approach for anomaly intrusion

detection,� arXiv preprint arXiv:1404.2772, 2014.

[130] �Verizon, 2013 data breach investigations report.� http:

//www.verizonenterprise.com/resources/reports/rp_

data-breach-investigations-report-2013_en_xg.pdf, 2013 (ac-

cessed March 10, 2018).

[131] P. Fournier-Viger, R. Nkambou, and V. S.-M. Tseng, �Rulegrowth: mining

sequential rules common to several sequences by pattern-growth,� in Pro-

ceedings of the 2011 ACM symposium on applied computing, pp. 956�961,

ACM, 2011.

[132] P. Fournier-Viger, �An Introduction to Sequential Rule Mining�The Data

Mining Blog.� http://data-mining.philippe-fournier-viger.com/

introduction-to-sequential-rule-mining/, 2015 (accessed February

7, 2018).

http://www.verizonenterprise.com/resources/reports/rp_data-breach-investigations-report-2013_en_xg.pdf



http://data-mining.philippe-fournier-viger.com/introduction-to-sequential-rule-mining/

http://data-mining.philippe-fournier-viger.com/introduction-to-sequential-rule-mining/

BIBLIOGRAPHY 183

[133] S. K. Harms, J. Deogun, and T. Tadesse, �Discovering sequential associa-

tion rules with constraints and time lags in multiple sequences,� in Interna-

tional Symposium on Methodologies for Intelligent Systems, pp. 432�441,

Springer, 2002.

[134] D. Lo, S.-C. Khoo, and L. Wong, �Non-redundant sequential rulesâ��the-

ory and algorithm,� Information Systems, vol. 34, no. 4-5, pp. 438�453,

2009.

[135] P. Fournier-Viger, U. Faghihi, R. Nkambou, and E. M. Nguifo, �Cmrules:

Mining sequential rules common to several sequences,� Knowledge-Based

Systems, vol. 25, no. 1, pp. 63�76, 2012.

[136] G. Das, K.-I. Lin, H. Mannila, G. Renganathan, and P. Smyth, �Rule

discovery from time series.,� in KDD, vol. 98, pp. 16�22, 1998.

[137] J. Deogun and L. Jiang, �Prediction mining�an approach to mining associa-

tion rules for prediction,� in International Workshop on Rough Sets, Fuzzy

Sets, Data Mining, and Granular-Soft Computing, pp. 98�108, Springer,

2005.

[138] P. Fournier-Viger, T. Gueniche, and V. S. Tseng, �Using partially-ordered

sequential rules to generate more accurate sequence prediction,� in Interna-

tional Conference on Advanced Data Mining and Applications, pp. 431�442,

Springer, 2012.

Rare Sequential Pattern Mining of Critical Infrastructure ... · Rare Sequential Pattern Mining of...

Documents

Transcript of Rare Sequential Pattern Mining of Critical Infrastructure ... · Rare Sequential Pattern Mining of...