D6.1 Preliminary Architecture and Service Model of...

60
Project Deliverable D6.1 Preliminary Architecture and Service Model of Infrastructure Enhancements Project Number 700692 Project Title DiSIEM – Diversity-enhancements for SIEMs Programme H2020-DS-04-2015 Deliverable type Report Dissemination level PU Submission date August 31 st , 2017 Resubmission date May 31 st , 2018 Responsible partner Amadeus Editor Miruna Mironescu Revision 2.0 The DiSIEM project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 700692.

Transcript of D6.1 Preliminary Architecture and Service Model of...

Page 1: D6.1 Preliminary Architecture and Service Model of ...disiem-project.eu/wp-content/uploads/2018/06/D6.1v2.pdf · Package 6, this report contains partial results for Tasks 6.1, 6.2

Project Deliverable

D6.1 Preliminary Architecture and Service

Model of Infrastructure Enhancements

Project Number 700692 Project Title DiSIEM – Diversity-enhancements for SIEMs Programme H2020-DS-04-2015 Deliverable type Report Dissemination level PU Submission date August 31st, 2017 Resubmission date May 31st, 2018 Responsible partner Amadeus Editor Miruna Mironescu Revision 2.0

The DiSIEM project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 700692.

Page 2: D6.1 Preliminary Architecture and Service Model of ...disiem-project.eu/wp-content/uploads/2018/06/D6.1v2.pdf · Package 6, this report contains partial results for Tasks 6.1, 6.2

D6.1

2

Editor Miruna Mironescu, Amadeus Contributors Miruna Mironescu, Amadeus Zayani Dabbabi, Amadeus Frances Buontempo, City Susana González Zarzosa, Atos Gustavo Gonzalez, Atos Adriano Serckumecka, FCiências.ID Ibéria Medeiros, FCiências.ID Alysson Bessani, FCiências.ID Version History Version Author Description 1.0 Amadeus, City,

Atos, FCiências.ID First version of the document, submitted to the EC

1.1 Amadeus, FCiências.ID

Corrections of the minor problems requested by the reviewers after the first period

2.0 FCiências.ID Minor corrections by the project coordinator, submitted to EC

Page 3: D6.1 Preliminary Architecture and Service Model of ...disiem-project.eu/wp-content/uploads/2018/06/D6.1v2.pdf · Package 6, this report contains partial results for Tasks 6.1, 6.2

D6.1

3

Executive Summary This report provides a preliminary architecture and service model of SIEM infrastructure enhancements. Four extensions/components are discussed in this report: Enhanced Application Monitoring, Network-based Behavior Anomaly Detection, Diverse monitoring of critical assets and Cloud-backed Long-term Event Archive. For each extension, we provide a comprehensive state of the art study, a high-level design definition and a detailed models description. The Enhanced Application Monitoring and Network-based Behavior Anomaly Detection extensions aim to leverage both rule based and behavioral anomaly detection models for an early detection of deviations from normal users’ behavior. A combination of detection models is also used to improve the robustness and the performance of the anomaly detection. The extensions rely on both application logs and network traces for the learning and detection phases, and feed the detection results to the SIEM. The Diverse monitoring of Critical Assets module, which is part of the Diversity Assessment and Forecasting component in the DiSIEM architecture (Deliverable 2.2), is used to filter, aggregate and adjudicate alerts coming from diverse monitoring systems. This extension allows to assess the uncertainty of alerts raised and estimate the performance of combining monitoring system results. This extension has also a forecasting and prediction feature.Based on previous alerts and labelled data, the forecasting uses known reliability growth models to generate predictions of the time to the next false positive and to the next false negative alert. Finally, the Cloud-backed Long-term Event Archive component aims to address the storage limitations of the current SIEM systems. This extension provides a long-term cloud-based archival solution for SIEM events, offering a secure and cost-effective events archival system that leverages public cloud storage services. The extension can give support to a variety of compliance processes and facilitate carrying forensic investigations spanning long periods of time.

Page 4: D6.1 Preliminary Architecture and Service Model of ...disiem-project.eu/wp-content/uploads/2018/06/D6.1v2.pdf · Package 6, this report contains partial results for Tasks 6.1, 6.2

D6.1

4

Table of Contents 1 Introduction ....................................................................................................... 7 2 Enhanced Application Monitoring ...................................................................... 8

2.1 Motivation ................................................................................................. 8 2.2 State of the art ........................................................................................... 9

2.2.1 Anomaly detection ................................................................................. 9 2.2.2 Anomaly Detection Techniques ............................................................ 10 2.2.3 Network Anomaly Detection ................................................................ 19 2.2.4 Application-Specific Anomaly Detection ............................................... 21

2.3 Anomaly Detection Models ...................................................................... 23 2.3.1 Application Anomaly Detection Framework ......................................... 23 2.3.2 Network-based Anomaly Detection ..................................................... 27

2.4 Architecture ............................................................................................. 29 2.4.1 Application-based anomaly detection .................................................. 29 2.4.2 Network-based Behavior Anomaly Detection ....................................... 31

2.5 Testing and Validation .............................................................................. 33 2.5.1 Application-based Anomaly Detection ................................................. 33 2.5.2 Network-based Anomaly Detection ..................................................... 34

3 Diverse monitoring of critical assets ................................................................. 35 3.1 Motivation ............................................................................................... 35 3.2 State of the Art......................................................................................... 35 3.3 Models ..................................................................................................... 36 3.4 Architecture and Implementation ............................................................ 37 3.5 Testing and Validation .............................................................................. 38

4 Cloud-backed Long-term Event Archive ........................................................... 39 4.1 Motivation ............................................................................................... 39 4.2 State of the Art......................................................................................... 40 4.3 Secure Data Storage Using Multiple Clouds .............................................. 41 4.4 SLiCER ...................................................................................................... 43

4.4.1 SLiCER Overview .................................................................................. 43 4.4.2 SLiCER Architecture .............................................................................. 44 4.4.3 Data Model .......................................................................................... 45 4.4.4 Query Model ........................................................................................ 47 4.4.5 Events Search Algorithm ...................................................................... 47

4.5 Indexing ................................................................................................... 48 4.5.1 No Index .............................................................................................. 48 4.5.2 Bloom Filters ........................................................................................ 48 4.5.3 Lucene Index ........................................................................................ 50

4.6 Preliminary Cost Analysis ......................................................................... 52 5 Summary and Conclusions ............................................................................... 54 6 References ....................................................................................................... 55

Page 5: D6.1 Preliminary Architecture and Service Model of ...disiem-project.eu/wp-content/uploads/2018/06/D6.1v2.pdf · Package 6, this report contains partial results for Tasks 6.1, 6.2

D6.1

5

List of Figures Figure 1 : A Membership Function example ............................................................. 24 Figure 2 : Skeptic Architecture ................................................................................. 26 Figure 3 : Novelty Detection Method [63] ................................................................ 28 Figure 4 : Outlier Detection Method [64] ................................................................. 29 Figure 5 : Enhanced Application Monitoring extension architecture. ....................... 31 Figure 6 : Network-based Anomaly Detector Preliminary Architecture .................... 33 Figure 7: Architecture of Diversity and Forecasting tool: Implementation and

Architecture ..................................................................................................... 37 Figure 8: SIEM archival architectures. ...................................................................... 41 Figure 9: SLiCER dependable cloud storage, adapted from DepSky. ......................... 42 Figure 10: The SLiCER system architecture. .............................................................. 44 Figure 11: SLiCER data model. .................................................................................. 46 Figure 12: Bloom Filter execution scenario. ............................................................. 49 Figure 13: Indexing example performed by Lucene [102]. ........................................ 51

Page 6: D6.1 Preliminary Architecture and Service Model of ...disiem-project.eu/wp-content/uploads/2018/06/D6.1v2.pdf · Package 6, this report contains partial results for Tasks 6.1, 6.2

D6.1

6

List of Tables Table 1 : Reference prices charged by the Amazon S3 to store and read data [85]... 52 Table 2 : Amount of data generated by partners of the DiSIEM project ................... 53 Table 3 : Estimate costs for storing data in the cloud after 5 years........................... 53

Page 7: D6.1 Preliminary Architecture and Service Model of ...disiem-project.eu/wp-content/uploads/2018/06/D6.1v2.pdf · Package 6, this report contains partial results for Tasks 6.1, 6.2

D6.1

7

1 Introduction Security Information and Event Management (SIEM) products has a reputation as a helpful and must-have solution to improve security for any enterprise. However, SIEMs has certain limitations, which makes it inefficient without additional investments. As discussed in Deliverable 2.1, some of the limitations of existing SIEMs are the lack of advanced user behavior analytics capabilities for application monitoring, absence of methods for assessing, predicting and forecasting the performance capabilities of diverse monitoring system, and also the fact that most SIEMs lack cost-effective and secure long-term event archival capabilities. The goal of the Work Package 6 (“Infrastructure Enhancements”) is to address these limitations by developing different types of extensions that will enhance the SIEM capabilities. This report presents a detailed description of the design and underlying models used for the infrastructure enhancement extensions. As a first deliverable in the Work Package 6, this report contains partial results for Tasks 6.1, 6.2 and 6.3. The rest of the report is structured into three main chapters, plus a conclusion. In Chapter 2, we present the enhanced application monitoring extensions. Two contributions are proposed here: an application-specific anomaly detection (Application-based Anomaly detection component) and a network anomaly detection sensor (Network-based Behaviour Anomaly Detection component). For both components, we provide a preliminary design. The chapter also includes a study of the state of the art for both approaches, and defines the analytical models used for anomaly detection and the motivation behind the choice. In Chapter 3, we provide a description of the Diverse Monitoring of Critical Assets module, a key part of the Diverse Monitoring Assessment and Forecasting component. Similarly to the previous chapter, this chapter contains a description of the architecture of the module, related work and the extension model to implement. In Chapter 4, we present a description of the Cloud-backed Long-term Event Archive component, which we dubbed SLiCER (Safe Long-term Cloud Event aRchival). This chapter includes a detailed description of the motivation behind the development of the extension, an in-depth state of the art study and a detailed description of SLiCER architecture, data and query models.

Page 8: D6.1 Preliminary Architecture and Service Model of ...disiem-project.eu/wp-content/uploads/2018/06/D6.1v2.pdf · Package 6, this report contains partial results for Tasks 6.1, 6.2

D6.1

8

2 Enhanced Application Monitoring

2.1 Motivation There has been a great shift in the security analytics and threat hunting landscape in the recent years. SIEM providers have been facing tough challenges to safeguard against the ever-advancing threats. SIEM provides adequate solutions for the Security Operations Center (SOC) and IT teams in terms of real-time monitoring based upon correlation rules and alerts. Moreover, it provides the long-term storage, search and reporting mechanisms for amplified visibility into network log data. Although the SIEM has retained the security market for many years, it is showing its limitation when it comes to modern emerging threats. Below are some of the known SIEM limitations nowadays:

• It lacks the ability to identify the unknown and hidden threats already present in the network.

• It does not provide an entity-centric view related to insider threats. SIEM correlation rules can be used to detect a threat via data coming into the network, but it does not have the ability to analyse the behaviours of the users and hosts inside the network.

• SIEM’s processing power is limited to a specific data set and schemas, which does not give you a full picture of the activity inside an organization. In fact, SIEM solutions are historically associated with IT infrastructure, not user and entity behaviour monitoring.

• Most SIEM solutions lack of schema support for business application log data and lack out-of-the-box content (rules, alerts, dashboards, etc.) for custom application security monitoring

User and entity behaviour analytics (UEBA) can be used as an extension for SIEM to address these limitations. UEBA is an analytics led threat detection technology. It uses machine learning and data science to gain an understanding of how users (humans) within an environment typically behave, then to find risky, anomalous activity that deviates from their normal behaviour and may be indicative of a threat. User and entity behaviour analytics offers profiling and anomaly detection based on a range of analytics approaches, usually using a combination of basic analytics methods (e.g., rules that leverage signatures, pattern matching and simple statistics) and advanced analytics (e.g., supervised and unsupervised machine learning). According to Gartner report dated on August 2016 [1], it is assumed that by the end of 2017, at least 60% of major SIEM vendors will incorporate advanced analytics and UEBA functionality into their products, either through acquisitions, partnerships or natively. Currently, very few major SIEM providers offer advanced UEBA capabilities.

Page 9: D6.1 Preliminary Architecture and Service Model of ...disiem-project.eu/wp-content/uploads/2018/06/D6.1v2.pdf · Package 6, this report contains partial results for Tasks 6.1, 6.2

D6.1

9

To enhance SIEM infrastructure, we propose to develop two UEBA based Enhanced Application Monitoring extensions: An application based anomaly detection module and a network flow anomaly detector. While the former will use application audit logs to monitor user activities, the latter will perform deep packet inspection to perform network behaviour monitoring, in order to give visibility into user and application activity not captured in logs. Both extensions will send detection results to the SIEM used. Both the design and development of those components needs to be done in a generic way to facilitate their reuse in different use cases.

2.2 State of the art

2.2.1 Anomaly detection Anomaly detection is an open research that considers the problem of finding patterns in data that do not conform to expected behaviour. Such patterns are referred to as anomalies, outliers, discordant observations, exceptions, aberrations, surprises, peculiarities, or contaminants in different application domains (e.g., fraud detection for credit cards, insurance, health care, intrusion detection for cyber-security, fault detection in safety critical systems, military activities, etc.) [2]. Many techniques in the literature have been used to detect anomalies for different use cases and applications. Some techniques use a scoring system to rank anomalies based on the degree to which the evaluated instance deviates from the expectation. Other techniques use a labelling system to denote whether the instance is normal or anomalous. The latter is often dynamic in nature (e.g., new types of anomalies can be identified) making it difficult to associate the training data to a particular label. Based on the extent to which the labels are available, anomaly detection techniques operate in one of the following three modes [2]: (i) supervised, (ii) unsupervised and (iii) semi-supervised anomaly detection mode: Supervised Anomaly Detection, where, in a supervised mode, the training data set has labelled instances for normal and anomaly classes. A predictive model is generally built for normal vs. anomaly classes. Any unseen data instance is compared against the model to determine which class it belongs to. This technique has two major drawbacks: (a) anomalous instances are far fewer compared to normal instances in the training data; (b) labelling all classes of anomalous behaviour is a very difficult task. Semi-supervised Anomaly Detection, in which case, the training data set has labelled instances only for the normal class. Any data instance not falling in the class is considered an anomaly. Since this technique does not require labels for anomaly classes, it is more widely applicable than supervised techniques, specifically, in critical scenarios (e.g., spacecraft fault detection) where anomalous behaviour is difficult to model.

Page 10: D6.1 Preliminary Architecture and Service Model of ...disiem-project.eu/wp-content/uploads/2018/06/D6.1v2.pdf · Package 6, this report contains partial results for Tasks 6.1, 6.2

D6.1

10

Unsupervised Anomaly Detection, where this detection mode has the caveat that the notion of anomaly is not well-defined and thus prone to errors and uncertainties. Unsupervised mode does not require training data, and thus it is most widely applicable. Techniques in this category consider that normal instances are far more frequent than anomalies in the test data. If this is not true then such techniques suffer from high false alarms rate. Many semi-supervised techniques are adapted to operate in an unsupervised mode by using a sample of the unlabelled data set as training data. Such adaptation assumes that the test data contains very few anomalies and the model learned during training is robust to these few anomalies. In the following sections, we will provide a classification of popular Anomaly detection techniques that have been used in general, then we will focus on the techniques that are more suitable for network based and application based anomaly detection.

2.2.2 Anomaly Detection Techniques Many anomaly detection techniques have been proposed in the literature. Some are designed and applied on certain application domains, while others are more generic. Some researchers performed surveys [3] [2] [4] [5]on the existing anomaly detection techniques in order to provide a comprehensive overview and a taxonomy of the techniques used to solve the anomaly detection problem. In this section, we will present an overview of research directions to apply supervised and unsupervised methods to tackle the problem of anomaly detection. Based on the information used and the techniques employed we propose a classification of anomaly detection techniques into five major groups: (i) Statistical Methods, (ii) Knowledge Based Methods, (iii) Distance based Methods, (iv) Model based approaches and (v) Graph based Methods. However, anomaly detection algorithms are quite diverse in nature and thus one technique may fit into more than one category.

2.2.2.1 Statistical Methods Statistical anomaly detection techniques work by fitting a statistical model to the data in hand and then applying a statistical test on the unseen data instance to check whether it belongs to the model. The underlying principle of statistical anomaly detection can be summarized as [6]: "An anomaly is an observation which is suspected of being partially or wholly irrelevant because it is not generated by the stochastic model assumed". Statistical anomaly detection techniques assume that Normal data instances occur in the high probability zone of a statistical model and anomalies occur in the low probability spectrum.

Page 11: D6.1 Preliminary Architecture and Service Model of ...disiem-project.eu/wp-content/uploads/2018/06/D6.1v2.pdf · Package 6, this report contains partial results for Tasks 6.1, 6.2

D6.1

11

Statistical techniques could fit into two categories: parametric and non-parametric techniques: While parametric techniques assume the knowledge of the underlying data distribution and estimate the parameters from the data, non-parametric techniques make no assumptions about the data distribution. Nonparametric approaches might still include parameters but they are not used to define the shape of the distribution but rather the complexity of the model resulting from the distribution. Parametric techniques assume that the normal data is generated by a parametric distribution with parameters 𝛩and probability density function𝑓(𝑥, 𝛩), where𝑥 is an observation. The anomaly score of a test instance 𝑥 is the inverse of the probability density function,𝑓(𝑥, 𝛩). The parameters 𝛩are estimated from the given data. Based on the type of distribution assumed, parametric techniques can be further categorized as follows: Gaussian Model Based, Regression Model Based, and Mixture of Parametric Distributions Based.

Gaussian Model Based Parametric Outlier Detection assume that data is generated from a Gaussian distribution. The parameters are estimated using MLE (Maximum Likelihood Estimates). The distance between the data instance and the estimated mean is the anomaly score, usually a threshold is used to determine the anomalies from the anomaly scores. Regression Model Based Parametric Outlier Detection techniques are often composed of two steps, the first is to estimate the regression model to fit to the data, and the second is to compute the residual for the test instances against the model. The magnitude of the residual can be used as the anomaly score for the test instance. Mixture of Parametric Distributions Based anomaly detection techniques use a mixture of parametric statistical distributions to model the data.

Non-parametric Techniques use non-parametric statistical models, such that the model structure is not defined a priori, but is instead determined from given data. Such techniques typically make fewer assumptions regarding the data, such as smoothness of density, when compared to parametric techniques. Non-Parametric techniques can be categorized as follows: Histogram Based, Kernel Function Based.

The Histogram Based techniques consist of using histograms to maintain a profile of the normal data. A basic histogram based anomaly detection technique for univariate data consists of two steps. The first step involves building a histogram based on the different values taken by that feature in the training data. In the second step, the technique checks if a test instance falls in any one of the bins of the histogram. If it does, the test instance is normal, otherwise it is anomalous.

Page 12: D6.1 Preliminary Architecture and Service Model of ...disiem-project.eu/wp-content/uploads/2018/06/D6.1v2.pdf · Package 6, this report contains partial results for Tasks 6.1, 6.2

D6.1

12

Kernel Function Based is a non-parametric technique for probability density estimation is Parzen windows estimation [7]. This involves using kernel functions to approximate the actual density.

The disadvantage and advantage of using statistical anomaly detection techniques are as follows:

• The advantages of using statistical anomaly detection techniques are: statistically sound approach for anomaly detection if the assumptions on the data distribution are correct, the anomaly score is associated with confidence intervals, very useful when deciding about a data instance. Moreover, statistical techniques can be used in an unsupervised setting without any need for labelled data.

• The disadvantages of statistical techniques is the fact that they rely on the

assumption that data is generated from a particular distribution, this is particularly detrimental for high dimensional datasets. In addition, in order to detect anomalies, there are several hypothesis test statistics and choosing the best statistic is often not a straightforward task [8].

2.2.2.2 Knowledge Based Methods Knowledge-based anomaly detection methods are one of the earliest techniques used for anomaly and misuse detection. These techniques search for instances of known attacks, by attempting to match with pre-determined attack representations. Knowledge based anomaly detection techniques can be further divided into two approaches: Rule-based and Expert system approaches and Ontology and logic-based approaches. [9] Rule-based and Expert system approaches imply that expert systems encode intrusive scenarios as a set of rules, which are matched against audit or network traffic data. Any deviation in the rule matching process is reported as an intrusion. Most intrusion detection systems nowadays are signature based. The advantages of rule based anomaly detection are efficiency, the low false positive rate and the capability of embedding expert knowledge easily. The drawbacks are the need for receiving regular signature updates much like an antivirus and the fact that rule based anomaly detection techniques cannot detect novel attacks that are not previously stored in a rule database. Combining Machine-learning techniques with rule-based detection has been proposed in the literature. Rule extraction algorithms have been proposed. These algorithms aim to create rules that reflect approximate classification results. However, it seems that many rule extraction algorithms depend on neural networks and the rule extraction itself is usually a supervised learning task and needs previously created classification and labelling information. A Rule extraction

Page 13: D6.1 Preliminary Architecture and Service Model of ...disiem-project.eu/wp-content/uploads/2018/06/D6.1v2.pdf · Package 6, this report contains partial results for Tasks 6.1, 6.2

D6.1

13

framework based on unsupervised learning has been proposed by Antti Juvanen and Tuomo Sipola [10]. Ontology and logic-based approaches approach relies on modelling attack signatures using expressive logic structure in real time by incorporating constraints and statistical properties. Instead of simple rules and matching patterns, logic based approach patterns are specified as formulae in an expressively rich and efficiently monitorable logic. Different techniques have been proposed: EAGLE logic [11], finite state machine (FSM) methodology [12]. [13] used ontologies as a way of describing the knowledge of a domain, expressing the intrusion detection system in terms of the end user domain. Ontologies are used as a conceptual modeling tool allowing a non-expert person to model intrusion detection applications using the concepts of intrusion detection more intuitively.

2.2.2.3 Distance based Methods These types of approaches attempt to overcome limitations of statistical outlier detection approaches. The distance based methods detect outliers by calculating different distances between points. More explicitly, they compute the full dimensional distances of points from one another using various features enhanced by the densities of local neighborhoods [2]. Nearest neighbor based anomaly detection techniques are based on the following assumption: normal data instances are located in dense neighborhoods, meaning that anomalies are located “far” from their closest neighbor. Hence this distance based anomaly technique implies having a distance or a similarity measure defined between two points, which of course can be compute din different ways. In case of continuous attributes, Euclidian distance is a frequent choice. For categorical attributes, simple matching coefficient is often used. For the multivariate data instances, similarity or distance is usually computed for each attribute and then combined. It is worth mentioning that not all distances are required to be strictly metric, but they are normally required to be positive-definite and symmetric, but it is not a requirement for them to satisfy the triangle inequality. Nearest neighbor techniques can be largely grouped into two categories:

• Techniques that use the distance of a data point to its kth nearest neighbor as an anomaly score

• Techniques that use relative density of each data instance to compute the same thing

A benefit of this technique is that it is by default an unsupervised learning method and it does not make any assumptions regarding the distribution of data. It is purely data driven. However, semi-supervised techniques perform better the unsupervised ones in terms of anomaly detection, since the likelihood of an anomaly to form a close neighborhood in the training data set is quite low. Also, adapting this method to a different data type is rather straight forward, and it is centered on finding an appropriate distance measure. The downside however, is the fact that since it is

Page 14: D6.1 Preliminary Architecture and Service Model of ...disiem-project.eu/wp-content/uploads/2018/06/D6.1v2.pdf · Package 6, this report contains partial results for Tasks 6.1, 6.2

D6.1

14

unsupervised by default, if normal data instances do not have close neighbors, they will be considered as outliers. Also, in case of the semi-supervised techniques, if some of the normal instances in the test set do not have similar (i.e. normal) instances in the training set, a high false positive rate is expected. One of the biggest problems is the computational complexity in the testing phase i.e. computing the distance between all the test instances and training instances. It is worth mentioning that the performance of the nearest neighbor technique depends greatly on the chosen type of distance

2.2.2.4 Model based approaches Model based anomaly detection techniques are based on building learning data models using artificial learning techniques. In recent years, various learning techniques have been used in the literature. This category of anomaly detection techniques can be further divided into supervised and unsupervised learning techniques. Supervised methods provide a better detection rate than semi-supervised and unsupervised methods (in theory), since they have access to more information, however there are some inherent limitations to this approach: the usual lack of reliable training data and the fact that the training sets usually contain noise which induces higher false positive rates. The common supervised models used in the literature are Supervised Neural Networks, Support Vector Machines, Bayesian Networks and Decision Tree.

Bayesian Networks, where Heckerman [14] defined a Bayesian model as “A model that encodes probabilistic relationships among variables of interest. This technique is generally used for intrusion detection in combination with statistical schemes. It has several advantages, including the capability of encoding interdependencies between variables and of predicting events, as well as the ability to incorporate both prior knowledge and data. ” Bayesian networks have been used for anomaly detection in the multi-class setting. A basic technique for a univariate categorical data set using a naïve Bayesian network estimates the posterior probability of observing a class label from a set of normal class labels as well as the anomaly class label (given a test data instance). The class label with largest posterior is chosen as the predicted class for the given test instance. The likelihood of observing the test instance given a class and the prior on the class probabilities is estimated from the training data set [2]. Johansen and Lee [15] and Moore and Zuev [16] used supervised Naive Bayes classifiers to improve IDS detection performance. Neural Networks (NN) learning predicts different users and entity behaviors in systems. If they properly designed and implemented, NNs have the

Page 15: D6.1 Preliminary Architecture and Service Model of ...disiem-project.eu/wp-content/uploads/2018/06/D6.1v2.pdf · Package 6, this report contains partial results for Tasks 6.1, 6.2

D6.1

15

capability to address many problems encountered by rule-based approaches. NN’s main advantage is their tolerance to imprecise data and uncertain information, and their ability to conclude solutions from data without having previous knowledge of the regularities in the data. NN have the ability to generalize from learning data. Multilayer perceptron (MLP) and Radial basis function (RBF) are the most commonly used supervised neural networks. Multi Layered Perceptron (MLP). MLP can only classify linearly separable instances sets. If a straight line or plane can be drawn to separate the input instances into their correct categories, input instances are linearly separable and the perceptron will find the solution. If the instances are not linearly separable, learning will never reach a point where all instances are classified properly. Decision Tree is a powerful tool used for classification and prediction. DT builds classification models in the form of tree structures. The training dataset is broken into smaller subsets while an associated decision tree is incrementally developed. The core algorithm for building decision trees is called ID3 by J. R. Quinlan [17] which employs a top-down, greedy search through the space of possible branches with no backtracking. ID3 uses Entropy and Information Gain to construct a decision tree. Support Vector Machines (SVM), where the motivation behind SVM is to solve the problem of supervised binary classification. By creating a feature space as a finite dimensional vector space, SVM trains a model by creating a linear partition of the feature space into two categories or classes. The partition is called a Hyperplane and it is used to predict the class of unseen data instance by locating them above or below the separation plane. SVM is not limited to solving linearly separable dataset classification problems; in fact, by utilizing the kernel transformation technique, SVM becomes more flexible by introducing various types of non-linear decision boundaries. SVM was proposed by Vapnik [18] and has being used extensively in the research community for many reasons: suitable for High-Dimensionality classification problems, Memory efficient since only a small set of training data is used in the decision process and offers great flexibility using the kernel transformation technique. The disadvantages or SVM are mainly the fact that the classification result is non-probabilistic and in case where the dimension of the feature space is higher than the cardinality of the training dataset, SVM performs poorly.

Unsupervised Models are used to tackle the problem of the shortage in labelled datasets, and to find a structure in a collection of unlabeled data. Unsupervised anomaly detection techniques rely on the following assumptions: First, they presume that most of the data instances are normal and only a very small data percentage is abnormal. Second, they anticipate that anomalous data instances are statistically distinct from normal data.

Page 16: D6.1 Preliminary Architecture and Service Model of ...disiem-project.eu/wp-content/uploads/2018/06/D6.1v2.pdf · Package 6, this report contains partial results for Tasks 6.1, 6.2

D6.1

16

The most common unsupervised algorithms are, K-Means, Self-organizing maps (SOM), C-means, Adaptive resonance theory (ART), Unsupervised Niche Clustering (UNC) and One-Class Support Vector Machine.

K-Means represents a traditional clustering algorithm. It is used to divide the data into k clusters, and to guarantee that the data within the same cluster are similar, while the data in different clusters have low similarities with the data in other clusters. K-means algorithm first selects K data items at random as the initial cluster centers. The rest of the data is added to the cluster with the highest similarity according to its distance to the cluster center. Then, it recalculates the cluster center of each cluster. The same process is repeated until the centers of each cluster no longer change. Thus, data is divided into K clusters. After building the K clusters, the next step is to identify the cluster boundaries. For that, a percentile distance value is used. Assuming that anomalous data can also be included in the clusters, those will be further away from their cluster than normal data because of the deviation from normal behavior. After determining the cluster boundaries of each cluster, prediction methods can be performed. If the distance between the data point and its cluster’s center is greater than the cluster boundary, the data point can be considered anomalous. LI H [19] Proposed K-means algorithms for anomaly detection. Cuixiao et al. [20] proposed a mixed intrusion detection system (IDS) model. The algorithm used is an improved version of K-means clustering algorithm and it has been demonstrated to have a high detection rate. K-means works well when clusters have similar densities and the joint distribution of features within each cluster is spherical. The advantage of K-means is also its simplicity. The main drawbacks are the necessity to choose K, sensitivity to noise and outlier data points and initial centroids assignments. Fuzzy C-means is a clustering method, which grants one piece of data to belong to two or more clusters. It was developed by Dunn [21] and improved later by Bezdek [22]. It is used in applications for which hard classification of data is not meaningful or difficult to achieve. C-means algorithm is similar to K-Means except that membership of each point is defined based on a fuzzy function and all the points contribute to the relocation of a cluster centroid based on their fuzzy membership to that cluster. Shingo et al. [23] proposed a new approach called FC-ANN, based on ANN and fuzzy clustering to solve the problem and help IDS achieving higher detection rate, less false positive rate and stronger stability. Yu and Jian [24] proposed an approach integrating several soft computing techniques to build a hierarchical neuro-fuzzy inference intrusion detection system. For Unsupervised Neural Network, the two unsupervised neural networks that are mostly used are self-organizing maps and adaptive resonance theory.

Page 17: D6.1 Preliminary Architecture and Service Model of ...disiem-project.eu/wp-content/uploads/2018/06/D6.1v2.pdf · Package 6, this report contains partial results for Tasks 6.1, 6.2

D6.1

17

Both approaches are adequate for intrusion detection tasks where normal behavior is densely concentrated around one or two centers, while anomaly behavior and intrusions spread in space outside of normal clusters. The Self-organizing map (SOM) is trained by an unsupervised competitive learning algorithm proposed by Kohonen [25]. The aim of the SOM is to reduce the dimension of data visualization. SOM outputs are clustered in a low dimensional (usually 2D or 3D) grid. It usually consists of an input layer and the Kohonen layer, which is designed as the two-dimensional arrangement of neurons that maps n dimensional input to two dimensions. Adaptive Resonance Theory (ART) was proposed by Gail A. Carpenter and Stephen Grossberg [26].The adaptive resonance theory is based on competition and uses unsupervised learning model. ART is open to new learning (adaptive) without losing the old patterns (resonance). Unsupervised learning models Include ART-1, ART- 2, ART-3, and Fuzzy ART. Various supervised networks are named with the suffix ‘‘MAP’’, such as ARTMAP, Fuzzy ARTMAP, and Gaussian ARTMAP. Unsupervised Niche Clustering (UNC) was proposed by Nasraoui et al. [27]. UNC was designed to overcome the lack of robustness of previous clustering algorithms in the presence of noise, the assumption of a known number of clusters and the computational limitations that comes with the increase in the number of data points. For UNC, the clustering problem is converted to a multimodal function optimization problem within the context of Genetic Niching.

One-Class Support Vector Machine (SVMs) have been applied to anomaly detection in the one-class setting. Such techniques learn a region that contains the training data instances (a boundary). For each test instance, the basic technique determines if the test instance falls within the learned region to be declared as normal; otherwise it is declared as anomalous [28].

2.2.2.5 Graph based Methods A representative tool suite to detect anomalies is given by the graph based methods. New technologies for anomaly detection in graph data have been developed based on the long-range correlations properties of the graphs [29]. Graphs or other structured data, such as the sequential data, are used by dedicated algorithms in the process of machine learning. This has the role to identify the anomalies in the graphs [5]. In the case of graph data, the data objects are represented as vertices in a graph and are connected to other vertices with edges. The algorithms can address spatial anomalies as well. [2] The anomaly detection is the branch of data mining concerned with discovering rare occurrences in datasets. These anomaly detection techniques are based on graph

Page 18: D6.1 Preliminary Architecture and Service Model of ...disiem-project.eu/wp-content/uploads/2018/06/D6.1v2.pdf · Package 6, this report contains partial results for Tasks 6.1, 6.2

D6.1

18

algorithms that can be classified into static vs. dynamic graphs, attributed vs. plain graphs or based on approach, unsupervised vs. (semi-)supervised. [29] These methods are effective, scalable and robust. There are two approaches used for anomaly detection: outliers and graphs. Initially, anomalies are unstructured collections of multi-dimensional data points by using the outliers approach. “An outlier is an observation that differs so much from other observations as to arouse suspicion that it was generated by a different mechanism” [30]. However, data objects cannot always be treated as points lying in a multi-dimensional space independently since inter-dependencies might exist. For this reason, the graph approach can be used for effectively capturing these long-range correlations among inter-dependent data objects [29]. In addition to the suspicious behavior detection used in security, the anomaly detection techniques are used as well to identify the rare events [29].

2.2.2.6 Ensemble techniques Besides Outlier detection techniques evaluating a single algorithm only, outlier ensembles is a technique of combining multiple anomaly detection algorithms in order to boost their joint anomaly detection performance [5]. Ensemble analysis is a widely used meta-algorithm for many data mining problems such as classification and clustering. Numerous ensemble-based algorithms have been proposed in the literature [31] [32] [33] for these problems, though the approaches are different in supervised and unsupervised problems. The ensemble-based methods are categorized based on the approaches used. Three main approaches to develop ensembles are (i) bagging, (ii) boosting, and (iii) stack generalization. Bagging (Bootstrap Aggregating) increases classification accuracy by creating an improved composite classifier into a single prediction by combining the outputs of learnt classifiers. Boosting builds an ensemble incrementally by training mis-classified instances obtained from the previous model. Stack generalization achieves the high generalization accuracy by using output probabilities for every class label from the base-level classifiers. Different categorization approaches of ensemble techniques have been proposed by Aggarwal [31]. Many ensemble outlier detection techniques have been proposed in the literature: Octopus-IIDS [34] is an example of ensemble IDS. It is developed using two types of neural networks, Kohonen and Support Vector Machines. Chebrolu et al. [35]

Page 19: D6.1 Preliminary Architecture and Service Model of ...disiem-project.eu/wp-content/uploads/2018/06/D6.1v2.pdf · Package 6, this report contains partial results for Tasks 6.1, 6.2

D6.1

19

present an ensemble approach by combining two classifiers, Bayesian networks (BN) and Classification and Regression Trees (CART). Folino et al. [36] introduce a distributed data mining algorithm to improve detection accuracy when classifying malicious or unauthorized network activity using genetic programming (GP) extended with the ensemble paradigm. Nguyen et al. [37]build an individual classifier using both the input feature space and an additional subset of features given by k-means clustering. The ensemble combination is calculated based on the classification ability of classifiers on different local data segments given by k-means clustering. The advantages and disadvantages of ensemble techniques are:

• The advantages of ensemble techniques are: improving the quality and robustness of the solution, selecting the best consensus, avoiding dependency on small fluctuation of data and high stability.

• The drawbacks are high computational costs, difficult implementations and

an increased complexity.

2.2.3 Network Anomaly Detection Several models to detect and characterize non-conformity patters are proposed. Some of them are based on simple statistics calculated on traffic parameters (e.g., number of UDP packets, number of SYN packets), so that, when values are above a given threshold, an anomaly is signalled [38]. Some others are based on stronger statistical analysis that use density of the signal associated to the traffic in order to compute its anomaly score and issue signatures for different kinds of anomalies [39]. However, a signature by itself does not provide information about the source of the anomaly, the packets that constitute the anomaly, and other contextual data. They are, therefore, hardly usable for network and security administrators. Farraposo et al. [40] address the problem of detecting anomalies in traffic traces and their characterization/ identification, by using a two-action algorithm, in which the first action allows the location/identification of anomalies, while the second action is intended to classify the anomalies using multi-scale and multi-criteria sketch-based features defined by the algorithm. Three different time series are used: number of packets, number of bytes and number of new flows. Most of the previous work focuses on the identification of anomalies based on traffic volume changes. However, since not all of the anomalies are directly reflected in the number of packets, bytes or flows, it is not possible to identify them all with such metrics. Approaches to address this issue propose the use of IP packet header data [41]. Particularly, IP addresses and ports allow the characterization of detected anomalies.

Page 20: D6.1 Preliminary Architecture and Service Model of ...disiem-project.eu/wp-content/uploads/2018/06/D6.1v2.pdf · Package 6, this report contains partial results for Tasks 6.1, 6.2

D6.1

20

Anomalies exhibit themselves in network statistics differently; therefore developing general models of normal network behavior and anomalies is difficult. In addition, model-based algorithms are not portable across applications, a small change in the nature of network traffic or the monitored physical phenomena can render the model inappropriate. Nonparametric, learning algorithms based on machine learning principles are therefore desirable as they learn the nature of normal measurements and autonomously adapt to variations in the structure of normality [42]. Despite of all the developments in anomaly detection, the most popular procedure to detect non-conformity patterns in network traffic is still manual inspection during the period under analysis (e.g., visual analysis of plots, identification of variations in the number of bytes, packets, flows, etc.). Despite a large literature on traffic characterization, traffic anomalies remain poorly understood. The reasons for this are the following: (i) identifying anomalies requires a sophisticated monitoring infrastructure, and currently, most ISPs only collect simple traffic measures, e.g., average traffic volumes (using SNMP); (ii) ISPs do not have tools for processing measurements that can be used to detect anomalies in real time; (iii) the nature of network-wide traffic is high-dimensional and noisy, which makes it difficult to extract meaningful information about anomalies from any kind of traffic statistics [43]. Network Flow Anomaly Detection has been widely studied by Brauckhoff [44]. The author uses histogram-based anomaly detectors to pre-filter a set of suspicious flows and apply association rule mining to extract the flows that have caused a malicious event. The model monitors one of the following attributes: transport protocol, source IP address, destination IP address, source port number, destination port number, packets per flow, bytes per flow, inter-arrival times, flow duration, and TCP flags. Labels that identify when an anomaly has happened are required for evaluating whether an anomaly detection system is accurate or not. Research in labelling network traffic has been studied by Hachem [45]. In his research, Hachem proposes a multi-protocol label switching and defines virtual classes (e.g., first level, second level and third level suspicious) that reflect a level of suspiciousness using security attributes, such as, impact of the diagnosed flow, type of the attack, and confidence of the detection. Entropy-based Anomaly Detection models have been proposed by Wagner and Platter [46], as well as Ranjan et al., [47]. Authors suggest to measure entropy ratios of some network traffic features (e.g., IP addresses and port numbers) in order to detect worm outbreaks and massive scanning activities in near real time. If changes in entropy contents are observed, the method raises an alarm. Histogram-based Anomaly Detection approaches have been proposed by Kind et al. [48], where baselines are constructed from training data for particular network traffic features (e.g., source IP, destination IP, source port number, etc.). An alarm is raised if a deviation from a baseline is observed for any traffic feature during

Page 21: D6.1 Preliminary Architecture and Service Model of ...disiem-project.eu/wp-content/uploads/2018/06/D6.1v2.pdf · Package 6, this report contains partial results for Tasks 6.1, 6.2

D6.1

21

network monitoring. Brauckhoff et al. [49] have extended the method by associating the histogram-based anomaly detection approach to rule mining. As a result, it is possible to identify NetFlow records representing anomalous network traffic. Graph-based Anomaly Detection approaches have been studied by Weigert et al. [50]. With this approach, low grade security incidents are detected using information across a community of organizations (e.g., banking industry, energy generation and distribution industry, governmental organizations in a specific country, etc.). The method maintains graphs for IP addresses which communicate with members, and is able to alert the community members when suspicious activities are detected. Clustering-based Anomaly Detection approaches have been proposed by Münz et al., [51], where training data containing unlabelled flow records are separated into clusters of normal and anomalous traffic. The approach applies k-means clustering algorithm for NetFlow training data to divide the datasets into different clusters. The resulting cluster centroids are deployed for fast detection of anomalies in new monitoring data based on simple distance calculations. Machine learning techniques have been widely proposed in the literature as a viable approach for network anomaly detection. Wagner et al. [52], for instance, propose a machine learning approach for IP-Flow record anomaly detection. The approach is based on support vector machines (SVM), in order to analyse and classify large volumes of Netflow records.

2.2.4 Application-Specific Anomaly Detection Although massive amounts of application log data is generated by automated systems nowadays, there is less research dedicated to Application-Specific Anomaly Detection compared to Network traffic Anomaly Detection. The general trend observed in the literature for network anomaly detection is performed without knowing and considering which applications/services are responsible for the traffic. Such approach is generally motivated by the complexity of dealing with different application environments and having to perform time-consuming tuning and configuration for the developed frameworks. Some researches focused on building Context-Aware Anomaly Detection at the Application Layer by analysing the application-layer content in network traffic. Patrick Duessel et al. [53] worked on solving the issue of incorporating protocol context into payload-based anomaly detection. A new data representation called “cn-grams” has been used to integrate syntactic and sequential features of payloads and provides a basis for context-aware detection of network intrusions. This approach works for both text-based and binary application-layer protocols and demonstrates superior accuracy on the detection. One class support vector machine was used as an anomaly detection technique. Similar approach was developed by Konrad Rieck [54] leveraging more advanced features and two different anomaly detection algorithms. Three categories of

Page 22: D6.1 Preliminary Architecture and Service Model of ...disiem-project.eu/wp-content/uploads/2018/06/D6.1v2.pdf · Package 6, this report contains partial results for Tasks 6.1, 6.2

D6.1

22

features have been used: Numerical Features for application payloads: Number of security-related keywords, Length of payload, Byte entropy of payload, etc., Sequential Features for application payloads: Bag-of-Tokens, Q-grams, All-Sub sequences and Syntactical Features for Payloads using parse trees: Bag-of-Node, Selected-Subtrees and All-Subtrees. The two used anomaly detection techniques are: One-class Support Vector Machine and k-nearest neighbours. The framework, called SANDY, was empirically evaluated using HTTP and FTP network traffic and over 100 attacks unknown to the applied learning methods, and outperformed several state-of-the art anomaly detection methods by identifying 80–97% unknown attacks with less than 0.002% false positives. The previous studies are step toward application-specific anomaly detection, by proposing an enhancement of network anomaly detection by including application aware features, however both approaches are not designed to perform an application-specific focused analysis. Some studies had been carried around Application-Specific Traffic Anomaly Detection: Hassan Alizadeh et al. proposed an Application-Specific Traffic Anomaly Detection Using Gaussian Mixture-Universal Background Model [55] to build a profile for genuine traffic of individual applications. The framework used a PCAP file as dataset with 12 million packets generated by 23 applications of various types. In total 42 features have been used, some of those features are: the total number of packets/bytes, the minimum/ mean/ maximum/ variance of packet length (in bytes) as well as of inter-arrival time, flow duration, etc. PCA was then used for feature selection to extract a subset of 11 features selected. The evaluation results yields an average of 2.75% of FNR and 2.83 % of FPR. Christopher Kruegel et al. [56] presented an intrusion detection system that uses different anomaly detection techniques to detect attacks against web servers and web-based applications, an application specific approach that focuses on client queries and parameters. The anomaly detection system proposed takes as input the web server logs files and produces an anomaly score for each web request. The system takes advantage of the particular structure of HTTP queries that contain parameters compared against established profiles that are specific to the program. This approach supports a more focused analysis with respect to generic anomaly detection techniques that do not take into account the specific program being invoked. The anomaly detection process uses a number of different models (e.g., Frequency Models, Mean and Standard Deviation Models, Markov and Bayesian probability) to identify anomalous entries within a set of input requests associated with an application. A model is a set of procedures used to evaluate a certain feature of a query attribute or a certain feature of the query as a whole. Each model is associated with an attribute of a program by means of a profile. The anomaly scores for a query and its attributes are derived from the probability values returned by the corresponding models and aggregated into a single anomaly score using a weighted sum. The models can be applied in two modes: learning and detection. While the

Page 23: D6.1 Preliminary Architecture and Service Model of ...disiem-project.eu/wp-content/uploads/2018/06/D6.1v2.pdf · Package 6, this report contains partial results for Tasks 6.1, 6.2

D6.1

23

former is used to set up the normal probability thresholds for queries and queries’ attributes, the latter is used to evaluate the anomaly score of new requests. The results show that the system is able to detect a high percentage of attacks with a very limited number of false positives.

2.3 Anomaly Detection Models

2.3.1 Application Anomaly Detection Framework The enhanced Application Monitoring extension we propose is a generic Anomaly and Misuse detection Framework that uses different detection techniques. Unlike most of the Network Anomaly Detection approaches proposed in the literature, our approach uses application logs as input instead of network packets. The Framework is designed to be tuned both manually and programmatically to adapt to the applications to monitor. By adapting the techniques used for each application to monitor, we believe the framework will perform better since more context and information can be harnessed when focusing on a specific applications. The framework is based on the Outlier Ensemble techniques. The motivations behind this decision are:

• Anomaly Detection Models: they are usually constructed using a subjective and heuristic process based on an analyst’s understanding.

• Assumptions and Hypotheses: these are made in the modelling phase can often be imperfect.

• Models: these models may work better on some parts of the data than other. The ensemble analysis approach is used in order to reduce the dependence of the model on the specific data set or data locality, and increase the robustness and performance of anomaly detection process. The underlying idea is simple: “Combine the results from different models to create a more robust model.” As described in the Architecture and Use Cases section, the framework uses a Hybrid Anomaly/Misuse detection Engine composed of: a Rule Engine and a Behavioural Engine. In the rest of this section, we will describe the characteristics and the underlying techniques of each engine.

2.3.1.1 Rule Engine The rule engine is used to detect application misuse and malicious activities. The rule Engine is suitable for integrating the Expert Knowledge of the application monitored, and is particularly useful for harnessing community rules and known attacks. Input from OSINT extension will also be used by the Rule Engine.

Page 24: D6.1 Preliminary Architecture and Service Model of ...disiem-project.eu/wp-content/uploads/2018/06/D6.1v2.pdf · Package 6, this report contains partial results for Tasks 6.1, 6.2

D6.1

24

Two types of rules are supported, crisp/static rules and fuzzy rules, as follows: Crisp/Static Rules, where, in contrast to fuzzy rules, these rules are based on Boolean logic and yield a binary response to the input presented. Apache Drools will be used to implement and evaluate the crisp rules. Drools is a Business Management System (BRMS) solution, whose core components are: core Business Rule Engine (BRE), rule management application (Drools Workbench) and Eclipse IDE plugin for core development. A common scenario when crisp rules are used is to set thresholds on application session metrics (number of requests per session, session duration, requests per second, etc.), user performing actions outside of his granted privileges sphere, etc. The extension will be used as an application intrusion detection system deployment for an e-commerce application. For this deployment the top 10 OWASP [57] core ruleset from ModSecurity [58] will be integrated by the rule engine. Custom rules can also be added. Fuzzy Rules, where Fuzzy rules are a tool for expressing pieces of knowledge in “fuzzy logic”. Fuzzy logic is an approach to computing based on "degrees of truth" rather than the usual "true or false" (1 or 0) Boolean logic on which the modern computer is based. However, there does not exist a unique kind of fuzzy rules, nor is there only one type of "fuzzy logic" explained in [59]. The advantages of using fuzzy rules and fuzzy logic are:

• Inherently robust since it does not require precise, noise-free inputs • Rules can be modified and tweaked easily to improve system performance • Offers more flexibility to integrate expert knowledge with different certainty

levels into rules

Figure 1 : A Membership Function example

Page 25: D6.1 Preliminary Architecture and Service Model of ...disiem-project.eu/wp-content/uploads/2018/06/D6.1v2.pdf · Package 6, this report contains partial results for Tasks 6.1, 6.2

D6.1

25

The rules are based on the concept of Membership Functions. A membership function is a graphical representation of the magnitude of participation of each input. It associates a weighting with each of the inputs that are processed, define functional overlap between inputs, and ultimately determines an output response. The rules use the input membership values as weighting factors to determine their influence on the fuzzy output sets of the final output conclusion.

2.3.1.2 Behavioural Engine The behavioural Engine will primarily be based on the Skeptic Framework [60]. However, depending on the application monitored are detection approaches can be used. In the remaining of this section, we will provide a detailed description of the Skeptic Framework, and we list the different detection models that the extension will implement. Skeptic is a framework developed in order to detect functional misuse, abuse attempts and other anomalies in Amadeus applications. The framework is based on semi supervised statistical learning approach which is meant to automatically model user’s behavior and highlight significant anomalies (or significant deviations) from the previously learnt user model. Anomalies identified on different user features (e.g. strange connection time, unusual combination of user profile and user machine etc.) are later combined in a fuzzy way using an aggregation method inspired by MCDA (Multi-Criteria Decision Analysis). The combined anomalies are later fed into an intervention module whose scope is to prioritize the anomalies but also trigger actions based on the severity level (e.g. ask for a second authentication, terminate session etc.) It is worth mentioning that the user behavior models are automatically created by the behavior learning engine based on historical data (i.e. application logs) and statistical analysis. The overall Skeptic architecture can be seen in Figure 2. The upper layer, referred to as Behavioural Modelling, is designed to let the system continuously learn users’ behaviour in an unsupervised manner by using historical data and statistical analysis. Users’ sessions are monitored in order to extract certain features. After this, machine learning algorithms are used to build UBM (i.e. User Behaviour Models), that will be leveraged afterwards to evaluate new incoming users’ actions in order to detect anomalies (or significant deviations from the pattern) via the Scoring Layer. An important note is to adapt a user-centric modelling (since different types of users may expose different types of behaviour, hence not the same type of anomalies). The learning process is auto-adaptive: UBM’s are updated dynamically w.r.t. incoming data (users’ behaviour can change over time). The Anomaly Scoring Engine is located in the lower layer and it is meant to leverage the behavioural models (i.e. UBM’s) previously created for every feature that was extracted from the session of the application. Its role is to also associate a specific

Page 26: D6.1 Preliminary Architecture and Service Model of ...disiem-project.eu/wp-content/uploads/2018/06/D6.1v2.pdf · Package 6, this report contains partial results for Tasks 6.1, 6.2

D6.1

26

anomaly scoring function to every model. The scope is to have rare of infrequent values having higher anomaly scores, while frequently observed values will be associated to low anomaly scores. One of the key points is to evaluate new users’ sessions against the previously learnt models. A global anomaly score is calculated by aggregating all individual anomaly scores using MCDA techniques (i.e. multi-criteria decision analysis).

Figure 2 : Skeptic Architecture

The models that may be used by the Skeptic Framework are:

• Histogram based models: A non-parametric statistical model, also referred to as frequency based or counting based. One of the advantages of using this model is to have fewer assumptions regarding the data. Popular in the intrusion detection community, this model is simple yet effective since data from application is usually governed by certain user profiles. This model is particularly suitable for discrete session features such as: number of requests per session, IP address used, number of errors generated, etc.

• Gaussian Mixture Models: A parametric statistical model that assumes data is generated from a mixture of Gaussian distribution. The Gaussian parameters are estimated using the Maximum Likelihood Estimates (MLE). This model is suitable for time-based session features such as the connection time and the session duration etc.

Page 27: D6.1 Preliminary Architecture and Service Model of ...disiem-project.eu/wp-content/uploads/2018/06/D6.1v2.pdf · Package 6, this report contains partial results for Tasks 6.1, 6.2

D6.1

27

• Regression Models: A model suitable for time-series data. After the model is

fit to the application session data, the residual for the new data instances is used to determine the anomaly score. The magnitude of the residual reflects the degree of abnormality of a session.

• Markov Chain Mixture Models: A simple and rich tool to model the sequences of actions performed by a user during a session. This model can fit different user profiles and predict the likelihood of a sequence of actions during an application session. Two possible approaches have been used in the literature: Using the Expectation Maximization algorithms or use a Hidden Markov Model for the underlying sequence structure.

• Bag of words model: A popular model in the information retrieval and text

processing domains. This model is adapted to application payloads and query parameters. Bag of words is an algorithm that counts how many times a word appears in a document. Word counts allow comparing documents and gauging their similarities. In the context of monitoring user activities on an application, a document can be the query sent by a user.

Note that the list above is not exhaustive. As an ensemble anomaly detection framework, Skeptic is designed allows to plug and configure different detection models. Depending on the application to monitor, other unsupervised and semi-supervised techniques can be more suitable to detect anomalous behaviours than the Skeptic Framework. In addition to Skeptic, below is a non-exhaustive list of anomaly detection techniques that may be combined and used as the Behavioural Engine of the Enhanced Application Monitoring extension:

• Self-Organising Map • Adaptive Resonance Theory • Fuzzy C/K means • One Class Support Vector Machines • Principle Component Analysis • KNN

Finally, the extension will allow defining the models to use for the Behavioural Engine, and configure model parameters and thresholds.

2.3.2 Network-based Anomaly Detection The proposed network anomaly detector uses one class support vector machines (One-Class SVM) as described in [61]. One-Class SVM has been introduced by Schölkopf et al. [62] as an unsupervised algorithm that learns a decision function for novelty detection (i.e., classifying new data as similar or different to the training set). The choice of this method relies on the fact that this is a type of unsupervised

Page 28: D6.1 Preliminary Architecture and Service Model of ...disiem-project.eu/wp-content/uploads/2018/06/D6.1v2.pdf · Package 6, this report contains partial results for Tasks 6.1, 6.2

D6.1

28

learning process with no class labels, the method takes as input an array X and detects the soft boundary of that set so as to classify new instances as belonging or not to that set. Two important aspects must be considered while using this method: (i) novelty detection, when the training dataset does not contain outliers; and (ii) outlier detection, when it does. Novelty Detection consists on learning a rough, close frontier delimiting the contour of the initial observations distribution, plotted in embedding -dimensional space. If further observations lay within the frontier-delimited subspace, they are considered as coming from the same population than the initial observations. Otherwise, if they lay outside the frontier, we can say that they are abnormal with a given confidence in our assessment. Figure 3 depicts an example of novelty detection method with error train = 19/200; errors novel regular = 3/40; and errors novel abnormal = 0/40.

Figure 3 : Novelty Detection Method [63]

Outlier detection aims at separating a core of regular observations from the polluting ones, called outliers. In this case, there is not a clean data set representing the population of regular observations that can be used to train any tool. Two techniques are widely used in outlier detection: fitting an elliptic envelope, and isolation forest. The former assumes that normal data come from known distributions (e.g., Gaussian distribution), and a boundary is defined to limit normal instances (observations within the boundary) from outliers (observations far enough from the boundary). The latter isolates observations by randomly selecting a feature and its maximum and minimum values, as a result, when a forest of random trees collectively produces shorter path lengths for particular samples, they are highly likely to be anomalies.

Figure 4(a) shows the calculation of the Mahalanobis distance from robust and non-robust estimates. If the inlier data are Gaussian distributed, then the inlier location and covariance will be estimated in a robust way (i.e. without being influenced by outliers). The Mahalanobis distances obtained from this estimate are used to derive a measure of outlyingness [65]. Figure 4(b) shows an example of Isolation Forest. Since recursive partitioning can be represented by a tree structure, the number of

Page 29: D6.1 Preliminary Architecture and Service Model of ...disiem-project.eu/wp-content/uploads/2018/06/D6.1v2.pdf · Package 6, this report contains partial results for Tasks 6.1, 6.2

D6.1

29

splitting required to isolate a sample is equivalent to the path length from the root node to the terminating node. This path length, averaged over a forest of such random trees, is a measure of normality and a decision function [66].

(a) Fitting an elliptic envelope (b) Isolation Forest

Figure 4 : Outlier Detection Method [64]

2.4 Architecture

2.4.1 Application-based anomaly detection The Enhanced Application Monitoring extension aims at improving applications security by leveraging a user-centric anomaly and misuse detection framework. The framework core is composed of a Behavioral Engine and a Rule Engine, combined to improve the detection performance and the overall robustness of the framework. The framework architecture is modular: functionalities are separated into independent modules, extensible: potential improvements can be easily applied by adding new modules or functionalities, reusable: although, the framework is designed to be application-specific, it can be deployed to monitor very different application and application environments. Using application logs as input, the system will leverage both supervised and unsupervised learning techniques to build user behavior models, and expert knowledge rules to associate an anomaly score to user sessions. The anomaly score will then allow taking the appropriate action. The architecture of the extension consists of several modules, as described in the following paragraphs and illustrated in Figure 5. Configuration module, which is a critical component to ensure the flexibility of the extension, the configuration module is responsible for generating a configuration object from a single configuration file. The configuration object will be used to define

Page 30: D6.1 Preliminary Architecture and Service Model of ...disiem-project.eu/wp-content/uploads/2018/06/D6.1v2.pdf · Package 6, this report contains partial results for Tasks 6.1, 6.2

D6.1

30

the behavior of the extension and its interaction with the SIEM used and the other DiSIEM extensions developed by partners. Concretely, the configuration module will be used to define the application log inputs, the input normalization and correlation scheme, the Behavioral models and their parameters, the set of rules to be used by the rule engine, the scoring aggregation scheme, and the output result destination. Input Module, where the input module is primarily used to fetch the input data that will be consumed by the framework. The input module will also be used to digest relevant OSINT data to improve the detection performance and integrate feedback from the visualization extension to tune the system parameters. Input Aggregation Module, where this module is particularly useful in case different logs formats are used as input. It is responsible for normalizing the different logs and aggregating the log records into log sessions. Anomaly/Misuse detection Engine, which is the core component of the extension. It is composed of two engines:

Behavioral Engine: this module is used to build detection models, to train the models and to predict an anomaly score to new application sessions. The Skeptic Framework (developed by Amadeus in order to detect functional misuse, abuse attempts and other anomalies for Amadeus applications [60]) will be the basis of the behavioral engine, however depending on the monitored application, different detection techniques can be used. Although Skeptic is designed to support different detection techniques, it uses a model per feature approach for the Behavioral Modelling. Rule Engine: equally important as the Behavioral Engine, it will allow to integrate Expert Knowledge and open source intelligence into the detection mechanism. Therefore, the Rule Engine will allow an early detection of known attacks and functional misuse scenarios with less computation. The Rule Engine will support both Crisp and Fuzzy rules depending on the certainty level of the rule to evaluate. Crisp rules are based on Boolean Logic, and Fuzzy rules will be based on Fuzzy Logic [67].

Output Module, which is a module, is used to forward the selected results of the anomaly detection mechanism to the right destination. Session anomaly score(s) and rule evaluation results are sent to the SIEM. The Output module can also be configured to send corrective action to the application monitored and to other extensions if needed.

Page 31: D6.1 Preliminary Architecture and Service Model of ...disiem-project.eu/wp-content/uploads/2018/06/D6.1v2.pdf · Package 6, this report contains partial results for Tasks 6.1, 6.2

D6.1

31

Figure 5 : Enhanced Application Monitoring extension architecture.

2.4.2 Network-based Behavior Anomaly Detection This section describes the preliminary architecture of the application monitoring extension that will be developed by Atos in DISIEM, as well as the proposed use cases. Taking as a starting point the analysis of the state of the art for the detection of anomalies (presented in Section 2.2), the normal behavior of the network traffic related to a specific application is modelled using an unsupervised learning process so that every event that falls outside the boundaries will be considered as anomalous. With this purpose, this detector will analyse the network traffic. It will also consider information related to the usage of the application monitored by the users, which are present in the application log files.

8/28/17 3

SIEMVisualizationandAnalysis

Tools

MonitoredApplications

Inputmodule

Outputmodule

BehavioralEngine

RuleEngineCo

nfigurationMod

ule

Alerts/Sessions

Logs

CorrectiveAction

OSINT

Feed

back

ApplicationSessions

Aggregation/Correlation/NormalizationModule

Internet

Page 32: D6.1 Preliminary Architecture and Service Model of ...disiem-project.eu/wp-content/uploads/2018/06/D6.1v2.pdf · Package 6, this report contains partial results for Tasks 6.1, 6.2

D6.1

32

Figure 6 presents the architecture of the component. The figure shows the following elements:

• Traffic Capture: it contains the dataset instances to be analysed during the training and the prediction process. Consequently, it can work in two different modes: regular traffic capture mode for training and monitoring mode for prediction. Traffic captured in this architecture is assumed to be legitimate during the training period. It will be collected from the managed infrastructure during the controlled execution of the monitored application. In order to classify traffic and perform the prediction in the monitoring mode, we need this Traffic Capture element to read traffic in real-time (legitimate or suspicious) from the monitored infrastructure to be processed by the sensor. To be able of associating the network traffic generated or received by the specific monitored application with the different actions taken by the users, relevant application logs (depending on the monitored application) will be also monitored by the sensor. Additional information collected from these logs will be provided in the dataset instances together with the network traffic recovered for the detection.

• Entry Point: it functions as a web service where applications (traffic coming from the monitored applications) are identified by their IP addresses and port numbers (e.g. in case a same host have several applications). The entry point requests the IP-Port or group of IPs and ports to the database in order to make the predictions.

• Training and prediction: this process uses machine learning to make predictions over the captured data. The training uses a predefined time window that can vary according to the size of the data to be analyzed (e.g. 5-second window). During the training process, the dataset is pipelined with a Principle Component Analysis (PCA) [68], which is in charge of reducing the dimensionality of the dataset while keeping as much as the variance possible.

• Database: it is used to store the traffic capture and the output of the training

and prediction processes.

• Prediction Service: the main objective of this service is to label data as normal or anomalous based on the results of the training and prediction process. This service provides the output of the network-based anomaly detector that will be sent to the visualization module and to the SIEMs. Periodically, the component will generate a JSON event with an indication of whether the analyzed traffic is normal or anomalous with respect to the learnt behavior models as well as a summary. These JSON events will be sent using Syslog to be integrated by the SIEMs. They will be also stored locally in a database that can be accessed by the SIEMs or other DiSIEM visualization components.

• Configuration: it allows interaction between the application and the end user.

Page 33: D6.1 Preliminary Architecture and Service Model of ...disiem-project.eu/wp-content/uploads/2018/06/D6.1v2.pdf · Package 6, this report contains partial results for Tasks 6.1, 6.2

D6.1

33

Figure 6 : Network-based Anomaly Detector Preliminary Architecture

2.5 Testing and Validation

2.5.1 Application-based Anomaly Detection The proposed Application Anomaly detection framework will be developed and tested on specific applications used by Amadeus as a first phase. The extension needs to be tested on different applications to ensure that its reusability for different use cases. Three potential applications may be used: an e-commerce web application, Active Directory Domain Services and on Jump Servers. The testing and validation will require security expert knowledge for the different application domains. Then as part of the “Technology Validation and Pilot

Page 34: D6.1 Preliminary Architecture and Service Model of ...disiem-project.eu/wp-content/uploads/2018/06/D6.1v2.pdf · Package 6, this report contains partial results for Tasks 6.1, 6.2

D6.1

34

Deployments” Work Package (WP7), the extension will be deployed and validated in operational environments.

2.5.2 Network-based Anomaly Detection The proposed algorithms need to be tested against a valid dataset, containing legitimate and malicious traffic, preferably from the same data domains that will be considered in WP7 (Technology Validation and Pilot Deployments) of the DiSIEM project. Ideally, the dataset should contain TCP packets with specific boundary intervals (e.g., start and end points), a set of features (e.g., duration, protocol type, service, etc.) as well as labels for normal and anomalous traffic. The testing and validation process should consider the following aspects:

• Testing is carried out on a representative subset of the data • Models are trained using only normal traffic samples; • Each attack is treated as a single anomaly class; • The training dataset is pipelined through a principal component analysis

(PCA) module; • The dataset dimensionality is reduced while keeping as much variance of the

original dataset as possible. • Trained models are compared and validated against the dataset • Conclusions on the performance of each model are obtained after the

validation process.

Page 35: D6.1 Preliminary Architecture and Service Model of ...disiem-project.eu/wp-content/uploads/2018/06/D6.1v2.pdf · Package 6, this report contains partial results for Tasks 6.1, 6.2

D6.1

35

3 Diverse monitoring of critical assets

3.1 Motivation SIEM event correlation is an essential part of any SIEM solution. It aggregates and analyses log data from across network applications, systems, and devices, making it possible to discover security threats and malicious patterns of behaviours. SIEM correlation rules leverage SIEM correlation capability to detect threats and raise alerts. However, the reliability of the alerts raised depends on the monitoring infrastructure setup and the part of the network in question. For example, a SIEM could accept data from several IDSs and combine these, though no SIEMs provide a way to assess configurations, for example the ideal number of IDSs to run on one part of a network or best adjudication rules. Furthermore, to the best of our knowledge, no current SIEM solution offers the capability to predict the time to the next false positive and to the next false negative alerts. To address these limitations, a SIEM extension for diverse monitoring of critical assets will be developed. This will involve the setup, commissioning and deployment of an infrastructure with diverse security tools such as Antiviruses, Firewalls, Intrusion detection systems, Vulnerability scanners etc. This extension will allow assessing the uncertainty of alerts raised and estimate the performance of combining monitoring system results.

3.2 State of the Art Currently, SIEMs provide a correlation facility to aggregate and fuse data from disparate monitoring systems. Specifically, event correlation applies “logical correlations among disparate individual raw log events” [69] to report potential security problems as alerts or actionable alarms. The logs and alerts from heterogeneous tools, including IDSs, firewalls, and system monitoring tools like Nagios are written in a variety of formats, from binary to human-readable, with different fields, from source IP address, to CPU usage. Some SIEMs normalise this data first, putting it into a relational database, picking out common fields. SIEM log correlation attempts to fuse this disparate data, either directly or from normalised data, using configurable rules, so make sense of what is happening in the monitored system. The data fusion, or event correlation, as it is referred to in the SIEM literature, often takes the form of if [and/or]/then statements. These can be formed into several steps, spotting the stages of an attack over time, or simply accept an alert from one IDS on one source IP address. Similar IP addresses could also be considered. The rules can be given a weighting, allowing the alerts they may produce to be quantified. Other systems “learn” a base line and spot divergence from normal usage patterns, for example QRadar can spot unusual usage or off-hours activity [70]. In terms of diversity assessment, SIEM correlation rules are aimed at the fusion of disparate monitoring tools. Their power is seen as coming from the integration of different event sources to provide security information in a single place about

Page 36: D6.1 Preliminary Architecture and Service Model of ...disiem-project.eu/wp-content/uploads/2018/06/D6.1v2.pdf · Package 6, this report contains partial results for Tasks 6.1, 6.2

D6.1

36

potential attacks, fraudulent behaviour and so-on in conjunction with the correlation rules which raise alarms. Simpler, log manager systems are available, such as the Loggly [71] or Elastic Stack, but they do not provide the correlation and alerting. Furthermore, SIEMS do not offer predictions and forecasting. They focus on real-time or near real time analysis of events. For example, The Infosec Institute gives six use-cases for Splunk’s search processing logic [72] for the “correlation” of events. SIEMs can also be used for forensics, looking back over what happened, provided the data is still available; SIEMs only store data for a limited time. The log centralisation performed by SIEMs make this task relatively straightforward, otherwise an operator would manually have to get log from each machine involved. Furthermore, many SIEMs now incorporate some form of advanced analytics, such as user and entity behaviour analytics (UEBA), which this project will develop further. This steps beyond logic-based rules, finding anomalies in data, but does not offer predictions. Modelling functionality is provided by a few software packages. Vulnerabilities, exploits and patches data are freely available. VepRisk analyses and visualises this security data, using reliability growth models (RGMs), assessing predictive accuracy and addressing recalibration. Further details are available in [73]. This standalone tool uses RGMs from openly available security data. In contrast, the diversity analysis and forecasting tool detailed in this section will apply the models to SIEM data. Other standalone tools have been written, including the Computer Aided Software Reliability Estimation tool (CASRE) [74]. This desktop tool provides many RGMs via a graphical user interface; older tools tend to provide a command line interface and send the results as a table to a file. In order to access the results in a SIEM, the underlying RGMs need to be driven via a script or service with an easy to use API, such as REST, that provides flexible ways to import data for the models and export the results in suitable formats for display. None of the existing tools provide this ability. Other modelling approaches include Mobius [75], modelling the behavior of complex systems, including the reliability, availability, and performance of computer and network systems. Mobius is used for building and analyzing probabilistic models of systems with the “Markov” property –that the future is independent of the past, given the current state of the model. It is a general tool for supporting assessment and prediction in engineering systems, allowing parameters to be investigated and different system configurations explored. It has previously been used for security modelling, for example [76] and [77]. The tool has been used for a variety of vulnerability assessments [78].

3.3 Models Eight models based on software reliability growth (SRG) state of the art will be incorporated. Deliverable 3.2 provides further details of the modelling itself. Research is ongoing to assess how to adapt these models to diversity monitoring and assessment for security, including suitable granularity of the inputs. They will provide

Page 37: D6.1 Preliminary Architecture and Service Model of ...disiem-project.eu/wp-content/uploads/2018/06/D6.1v2.pdf · Package 6, this report contains partial results for Tasks 6.1, 6.2

D6.1

37

a prediction of time to next false positive or other event used as input. They allow recalibration, giving more accurate predictions.

3.4 Architecture and Implementation This section contains the description of the architecture of the diversity monitoring and forecasting extension. Figure 7 shows its architecture, incorporating three main features. It counts alerts, through the “alert filter” feature, aggregated and adjudicating between diverse systems. Once the alerts and non-alerts have been assessed and labelled as true/false positive/negative data points, analysis can be made of specificity, accuracy and so on. Finally, these results can be used to make forecasts and predictions. The predictions can either be send to a results file, or fed back via syslogs or REST into the SIEM for display on the front end. Further details are given in the rest of this section.

Figure 7: Architecture of Diversity and Forecasting tool: Implementation and Architecture

Alerts from diverse monitoring tools, including many IDSs, anti-virus tools and firewalls watching the same traffic send their alerts and events to the existing SIEM. The alert filtering counts how many tools agree on the problematic traffic. In order to spot agreement and return counts of n out of n, 1 out of n, k out of n, etc. the database schema must allow joins on common fields, including timestamp, protocol and similar. This adjudication can be implemented as a widget or filter on the SIEM front-end, a SIEM correlation rule, or a database query or other script. The specifics of data collecting and aggregation or adjudication will vary between SIEMs. The traffic labelling implementation will also vary between SIEMs. In the case of XL-SIEM, a field can be added to the database indicating the actual status of the events: true alert or false. The front end will then allow manual editing of this field. This allows events from diverse monitoring tools to be labelled. If traffic was problematic, but did not cause an event in any monitoring tool, some way to add this to the system must be investigated. For other SIEMS, a stand-alone application will query the data-store, adding supplemental information indicating the actual status of

Page 38: D6.1 Preliminary Architecture and Service Model of ...disiem-project.eu/wp-content/uploads/2018/06/D6.1v2.pdf · Package 6, this report contains partial results for Tasks 6.1, 6.2

D6.1

38

alerts from monitoring tools (true or false positive), and information regarding packets which did not cause alerts (true or false negative). The new information can either be stored directly in the SIEM database or in supplemental storage. This can take the form of a Python script reading pcap files, and displaying their content for assessment by SOC operatives. It will allow all records matching chosen criteria to be labelled as actual alerts, to avoid manual labelling of pcaps one line at a time. Alternatively, an IDS or other monitoring tool with updated rules can be run over old pcap files to indicate if any malicious traffic is now spotted that had previously been missed. This will provide an approximation to false negatives, without the requirement for manual labelling. Furthermore, if the SOC operatives become aware of groups of traffic data being problematic, for example with a specific IP address, this can be encoded as an IDS rule, and again the IDS can be re-run over old pcap data to count how many alerts failed to be raised previously. Finally, the forecasting and prediction feature uses the collected alert counts and labelled data. It can report counts, and metrics such as accuracy together with data on times of false positives and negatives. When graphed, this will provide a visual clue to normal behavior, allowing abnormal conditions to be spotted quickly. It will then use known reliability growth models to generate predictions of the time to the next false positive and to the next false negative. This will be implemented as a Python script, calling a FORTRAN exe. This be run manually, invoked via a task scheduler or run as a service, to serve the data to the SIEM, either through a log file, or directly over HTTP (via a REST API). The Python will collect the data and write it to a file in the format expected by the FORTRAN code. It will then collect the results and transform them into a suitable format for the SIEM or visualization component.

3.5 Testing and Validation The data collection is already happening in a SIEM. In order to validate the “alert filtering” the joins or correlation of the data via the adjudication algorithm can be tested against artificial data ensuring coverage of edge cases. In a test system, the results can be observed, ensuring that certain properties hold, for example 1 out of n will generate more alerts than n out of n. This will provide a sanity check. For the data labelling, the algorithm can be tested against small sets of data covering edges cases to ensure the numbers of true positives etc. are correct. Once data has been collected and labelled, the package will allow the user to run the models directly, via a command line Python script. The models have been implemented in FORTRAN and thoroughly tested, so diff will show us if the numbers vary between calling this directly and calling it via a script to validate floating point output. The other scenarios and visualizations can be tested against historical data and visually assessed.

Page 39: D6.1 Preliminary Architecture and Service Model of ...disiem-project.eu/wp-content/uploads/2018/06/D6.1v2.pdf · Package 6, this report contains partial results for Tasks 6.1, 6.2

D6.1

39

4 Cloud-backed Long-term Event Archive

4.1 Motivation Nowadays, the collection and storage of events produced by a multitude of de- vices (firewall, IDS, IPS, anti-virus, syslog, etc.) and the need to concentrate all this data in a single place is crucial for many organizations. This central point of management is represented by the SIEM, which is used to collect, aggregate and process events.

Events register all kinds of actions happening in an organization infrastructure, such as network flows, operating system information, and security incidents. One of the most important benefits of collecting and keeping this data is the possibility to access it in retrospect to understand and investigate why and how some security incident happened.

The most recent SIEMs archive their events for a short period of time, between a week and a year [79] [80], but usually for 6 months. During this period, there are storage constraints which limits the necessary support to forensics activities when situations in which incidents could only be explained by already deleted events. For instance, some studies show that certain advanced security threats exploring zero-day vulnerabilities take on average 320 days until being discovered [81], and there are cases in which a certain threat was exploring systems for as long as nine years [82]. Furthermore, many regions or countries have decreed data privacy laws that require the exact date when a breach of sensitive data occurs to notify the affected parties in a timely manner. By considering this, some standardization institutions have been recommended one or more years of storing data. For instance, the COBIT (Control Objectives for Information and related Technology) and NIST (National Institute of Standards and Technology SP 800-53 and SP 800-92) have recommended one year [83] and 5 years [84] of event log data, respectively.

Faced to these facts and recommendations, a long-term archival of events is a desired feature of modern SIEMs. A component like this can support a variety of compliance processes for forensic investigations and on-site auditing requests. However, an important question is how much would cost such archival, since the storage of these events can reach many terabytes of data. Fortunately, the public cloud service providers (e.g., Amazon S3, Azure Blob Store, Rackspace Files and Google Storage) offer a secure cloud storage services for long duration with quite feasible costs. For instance, the cost of storing a gigabyte per month is less than USD 0.03 [85].

In order to address the local storage constraints of long-term archival, we propose to develop a cloud-based system for storing selected subsets of events for a long term, by using cloud storage services. We call this system SLiCER, which stands for Safe Long-term Cloud Event aRchival. The key idea is to define methods to organize the events by sensors, date, and time range, and pack, seal and store them in multiple clouds. This organization will allow the search of events in accordance with their date of occurrence, type and importance. In addition, it will allow retrieving events

Page 40: D6.1 Preliminary Architecture and Service Model of ...disiem-project.eu/wp-content/uploads/2018/06/D6.1v2.pdf · Package 6, this report contains partial results for Tasks 6.1, 6.2

D6.1

40

from the multiple clouds as efficiently as possible. We propose to replicate the events in multiple clouds to satisfy the security requirements of such data (which contains sensitive and personal information), storing thus such events in diverse cloud providers, employing techniques such as encryption and erasure codes [86]. For accessing the multiple clouds in a secure and efficient way, we employ the DepSky multi-cloud data storage library [87], adjusting and adapting it as needed.

Another important requirement of our component is cost effectiveness. As mentioned before, the cloud storage space is cheap, but reading data and performing requests can be very expensive. We aim for an attractive system to enable SOCs to have a long-term event archival system by exchanging their (limited) local event archive by SLiCER or, by extending its archival retention period by using SLiCER for storing important events that otherwise would be deleted.

4.2 State of the Art SIEM systems can be different in many ways, for instance, in how the events are collected, normalized, carried, indexed, and stored by them. Considering the database technology used by SIEMs to store events collected, current SIEM systems can be classified in two categories: those using traditional relational databases, such as SQLServer, Oracle and MySQL (e.g., ArcSight [88], OSSIM [89]), and those using NoSQL approaches (e.g., XL-SIEM, Splunk [90], Elastic Stack [91]). In the former, there is a separated storage system, the Logger that archive events coming from connectors, while the latter maintains the archive in the main SIEM database. Figure 8 (a) represents an architecture based on sensors that generate events and connectors that receive these events to normalize, filter, and aggregate them, before sending the events to the SIEM for correlation and reporting. In this architecture, the connectors forward events to be stored both on the SIEM database and on the logger. Generally, events are stored for a short time in the SIEM databases and more time in the logger (e.g., EDP stores by 3 months in the SIEM and 6 months in logger). After these periods, the oldest events are erased from the repositories. Even SIEM databases, which are relational in this architecture, store collected events taking into consideration their creation time. More specifically, they typically employ a multi-level database schema, in which there are different databases by category and/or by time range, i.e., one database for each day [92]. In this model, the recent events are kept on the first level to improve the performance of search queries accessing recent data. Later, these events are moved (e.g., every day or after a week) to lower levels, with expected lower access frequency, until reaching the levels where rare accesses are expected, and in some cases the events can even be unsearchable. The second architecture in Figure 8 (b) represents SIEMs following the NoSQL approach. In this model, typically there is no logger or connectors, meaning that the sensors send the events directly to the SIEMs that can do the whole processing. These SIEMs are more flexible, and able to scale their storage and processing needs

Page 41: D6.1 Preliminary Architecture and Service Model of ...disiem-project.eu/wp-content/uploads/2018/06/D6.1v2.pdf · Package 6, this report contains partial results for Tasks 6.1, 6.2

D6.1

41

to more nodes. Furthermore, they can also use a multi-level approach for organising data in time ranges. For example, the SIEM can be configured to store recent events for a short period of time (e.g., 3 months) in the fast-accessed Hot Events storage and, after this time, they are moved to another level/type of database (e.g., the Cold Events repository, and later to the Frozen Events repository), as illustrated in the figure. Splunk, for instance, has four levels of storage [93]. Notice that in all these levels, the events are still available for searching, but are expected to be less accessed than those that are in the first repository, and thus fewer resources are allocated for serving these requests.

Figure 8: SIEM archival architectures.

The local long-term event archive for events demand tens or thousands of terabytes of local storage space, turning it very expensive to maintain (e.g., staff, hardware, energy, etc.). On the other hand, if these data are discarded, important events would be lost. A solution to maintain these data easily and avoid data loss is to use cheap cloud storage services such as Amazon S3, Windows Azure Blob Store, Rackspace Files, Google Storage. Although some SIEMs can backup or create snapshots of its database to the cloud [94] [95], none of the existing systems provide a complete interactive cloud-backed archive that primarily use resources from the cloud. There are few solutions that propose to keep the whole SIEM in the cloud. However, due to concerns related to the confidentiality of the events and the lack of trust on third party services, many organizations are reluctant to fully employ such solutions. SLiCER can improve the archival capacities of SIEMs following both architectures discussed in this section, i.e., it can replace or extend the logger or the low-frequent storage levels of databases. Replacement refers to cases where SLiCER takes the place of these repositories, whereas extension involves the creation of an additional level (the cloud) to move the data after their current retention period.

4.3 Secure Data Storage Using Multiple Clouds SLiCER proposes to extend the storage time of archived events by putting them on cheap cloud storage services such as Amazon S3, Windows Azure Blob Storage, and Google Storage (among others). However, given that the events collected by SIEMs

Page 42: D6.1 Preliminary Architecture and Service Model of ...disiem-project.eu/wp-content/uploads/2018/06/D6.1v2.pdf · Package 6, this report contains partial results for Tasks 6.1, 6.2

D6.1

42

contain sensible information, many security concerns arise when the storage and management of such data is offloaded to third-party infrastructures. To overcome these concerns, SLiCER will store the events in a cloud-of-clouds, where data is encrypted and redundantly stored in several different cloud storage providers. This solution improves the availability, integrity, and confidentiality of the data stored, addressing four important limitations of the cloud [87]: loss of availability, if some provider is down, the data will be available on the other providers because the data is split and stored through several providers; loss and corruption of data, if some data was corrupted, it can be retrieved, due to data spread and replication with erasure codes; loss of privacy, due to the encryption; and vendor lock-in, if we want to change of provider, we do not need to migrate all data to the new provider. SLiCER relies on the DepSky and its evolutions [96] for implementing the cloud-of-clouds archival. Figure 9 shows DepSky data processing flow adapted to SLiCER. In a nutshell, the data is encrypted with a symmetric key, an error correction code (i.e., systematic erasure code [86]) is applied to generate redundant blocks, and finally the blocks are written in multiple cloud providers using Byzantine quorum protocols.

Figure 9: SLiCER dependable cloud storage, adapted from DepSky.

In the following we present a brief overview of the three techniques employed to ensure the security, dependability and cost-effectiveness of the events storage in the cloud.

Encryption: before sending data to the cloud services, this data should be encrypted in order to ensure its confidentiality. In the original DepSky, the encryption key is distributed through the clouds, together with the data blocks. However, in SLiCER we do not need that as the archival engine will be managed by a single entity that both write and read data. Therefore, SLiCER relies on a user-provided symmetric cryptographic key to encrypt/decrypt the data.

Page 43: D6.1 Preliminary Architecture and Service Model of ...disiem-project.eu/wp-content/uploads/2018/06/D6.1v2.pdf · Package 6, this report contains partial results for Tasks 6.1, 6.2

D6.1

43

Erasure codes: systematic erasure coding [86] is a type of error correction code that split the original data into m blocks and generate k additional blocks that can be used to recover the original data in case one of the blocks is lost. For instance, in the figure, the original data is split in two blocks, and two additional blocks are generated (m = 2 and k = 2), in this way each of the four clouds being used receive one block with half of the size of the original data. In order to recover the data, any combination of two of these blocks can be used, by simply concatenating the two first blocks or using the erasure code either to combine one of the original blocks with one of the generated blocks or the k generated blocks. In the end, the scheme ensures m blocks are needed to recover the original data, tolerating thus the failure of up to k clouds that store these blocks.

Byzantine quorums: the Byzantine quorum distribution [97] ensures that the data is written to a minimum number of providers in such a way that even if a number of clouds fail (by being offline or corrupting the data), there are enough replicas (other cloud storage providers) to form a quorum sufficiently large to rebuild the system and recover the written data. To ensure these features, it is required a set of n ≥ 3f + 1 storage clouds, of which at most f may be faulty, which correlates with the erasure code parameters in the following way: n=m+k and f=k. Availability is ensured because the data is stored in a quorum of at least n−f > k+f clouds [9, 17].

As a final remark, it is worth to note that although Figure 9 shows the data being written to the four clouds, this is not a strict requirement. Indeed, the system requires this number of clouds to be able to tolerate a single failure. However, writing the data to only three clouds (a quorum) is enough to ensure the data is both safe and live. In this way, we have enough redundancy for recovering the data in case of failures with a storage overhead of only 50%. The fourth cloud will be used only if one of the three preferred clouds (the cheaper or faster ones) is unresponsive.

4.4 SLiCER

4.4.1 SLiCER Overview SLiCER aims to overcome and address the storage constraints related with the actual SIEMs, providing a way to retain their events for long periods by leveraging the low cost of cloud storage services. The system can store events from many sources, such as a monitored device, another archival system, or even directly from a SIEM. The core of the system is responsible for organizing the received events for efficient storage and retrieval in the clouds, both in terms of performance and monetary costs. Secure storage is implemented by using cloud-of-clouds storage techniques (as explained in previous section), while data retrieval is done using indexes customised for each type of event. To implement such features, SLiCER needs to address the following challenges:

Page 44: D6.1 Preliminary Architecture and Service Model of ...disiem-project.eu/wp-content/uploads/2018/06/D6.1v2.pdf · Package 6, this report contains partial results for Tasks 6.1, 6.2

D6.1

44

• Find the most efficient way of retrieving events from SIEM (devices, connectors, loggers or directly from SIEM).

• Organize the collected events to later facilitate the queries to be performed on them.

• Store/read events to/from the clouds in a secure and cost-efficient way. • Provide an interface for the SOC analyst to interact with the system.

In the following subsections, we describe the SLiCER architecture, as well as its data and query models, in detail. We conclude this section with a description of the events retrieval algorithm, and a discussion of how SLiCER can be integrated with existing SIEMs.

4.4.2 SLiCER Architecture SLiCER is composed by three parts: Event Archiver, Query Manager, and Cache. Figure 10 shows the general architecture of the system, in which we can observe its three parts, and its interaction with the SIEM and the multiple clouds. The figure contains the following components.

Figure 10: The SLiCER system architecture.

Event Archiver, in which the event archiver encompasses the central storage functions of SLiCER in the following way:

1. Receives and organizes events by sensor/device: SLiCER monitors the arrival of new events coming from SIEM (devices, SIEM, Logger). Events are organized by sensor/device, putting the resulting event groups in a separated cache (as will be detailed later).

2. Creates event blocks: for each group of events, SLiCER splits it in blocks of events using a time range. SLiCER first calculates the time range taking into account the volume and size of the group of events to then proceed with splitting.

3. Creates index of blocks of events: the index is created for each block of events by a specific method, according to the type of stored events and the expected queries in the archive.

Page 45: D6.1 Preliminary Architecture and Service Model of ...disiem-project.eu/wp-content/uploads/2018/06/D6.1v2.pdf · Package 6, this report contains partial results for Tasks 6.1, 6.2

D6.1

45

4. Sends index to Cache: the index generated in the previous step is sent to the Cache in order to ensure that the queries can be run locally and faster, and without costs of reading data from the cloud. The block of events can also be sent to the Cache if we want to keep the recent events locally. Sends block and index to the cloud: the compressed block and its respective index are sent to the cloud using the techniques discussed in Section 4.3.

Query Manager, where the query manager is the interface between the SOC analyst and the cloud, and it performs the following actions:

1. Receives queries and gets the indexes of event blocks: the query manager receives and checks the query’s syntax, and fetches from the Cache the indexes of blocks of events from the devices specified in the query. The indexes can also be obtained from the cloud, in case of a cache miss/fail.

2. Executes queries and retrieves event blocks: the queries are executed over the indexes of each sensor/device, using the parameters of searching (specified in the query) with the index method predefined and resulting the event block names that are going to be read from the cloud. Then, these blocks are downloaded to the Cache.

3. Extracts events: the blocks are decompressed and the events matching the query are extracted and cached.

4. Delivers events to the SOC analyst: the resulting set of events is delivered to the SOC analyst.

Cache, where the function of the Cache is to store the indexes and the results of the queries executed locally by the query manager, namely the list of names of blocks that will be read from the cloud, the blocks downloaded from the cloud, the list of events that will be extracted from blocks, and the events themselves. The use of the Cache allows the execution of queries without accessing (or with fewer accesses) to the cloud, minimizing in this way the costs of reading from the cloud. The Cache can also be used to store the more recent compressed events, working like a logger.

4.4.3 Data Model The easiest way to design a long-term event archive would be to store all events in the cloud without a previous treatment, i.e., send events directly to the cloud in which each of them is a separated file in the cloud. On the one hand, this solution would allow SOC analysts to retrieve any events she wishes with fine granularity. On the other hand, this might generate thousands of files (events) per minute, which translates to thousands of requests for storage and retrieve of this data. The associated costs might render this straw man solution infeasible.

SLiCER design tries to find a balance between costs and performance of cloud storage usage, aiming for 1) low costs for data storage; 2) low costs in storing and retrieving data from the cloud; and 3) acceptable query performance in the cloud-backed archive.

Page 46: D6.1 Preliminary Architecture and Service Model of ...disiem-project.eu/wp-content/uploads/2018/06/D6.1v2.pdf · Package 6, this report contains partial results for Tasks 6.1, 6.2

D6.1

46

We propose a data model based on blocks of events to store and read data on the cloud. The related costs of cloud storage are better equalized using a structure of blocks, which generates less files to store and retrieve, and consequently, less requests to the cloud. The performance of searching is achieved by executing the queries over the indexes of blocks. So, we propose a data model working in two steps: blocks and indexes.

1. Blocks step: a SIEM infrastructure contains several types of sensors/devices, each of which may generate many types of events, each one with a different size. The data model organizes the blocks structure by sensors/devices and date time. For each sensor, it creates blocks by aggregating events of certain type generated during a time range (e.g., 10 min, 1 hour, 1 day). The definition of the time range depends on the size and frequency of the events. Therefore, SLiCER should be configured to create a correct size of blocks, neither too small nor too large.

2. Indexes step: for each block of events, the model creates its index using the chosen method. Basically, the index contains information related with the events contained in the block.

The data model using this structure of blocks will allow performing queries to specific sensors/devices and accessing their events by time ranges. These queries are performed using the indexes to calculate which blocks should be retrieved from the cloud.

Figure 11: SLiCER data model.

Figure 11 shows a representation of the data model. The first and second columns represent the devices and the date-time ranges, respectively. The third column represents the block file and its index file. These first three columns will be the

Page 47: D6.1 Preliminary Architecture and Service Model of ...disiem-project.eu/wp-content/uploads/2018/06/D6.1v2.pdf · Package 6, this report contains partial results for Tasks 6.1, 6.2

D6.1

47

structure sent to the cloud, being the fourth column only a visual representation of the events contained in the block, in the third column (Events.zip). For example, the Device 5 and date-time range 2017-06-01 06.00 are selected (highlighted in grey in the figure).

4.4.4 Query Model The query manager is the part of SLiCER that receives, interprets and executes a query. However, for a correct interpretation and execution of the query (with its parameters), the query must to be represented following the query model that follows the structure of blocks defined on the data model. Next, the query is executed by using the algorithm described in the next section.

A query is composed of three components - devices, time period and terms – as represented in the following tuple:

query = (d; startTime, endTime; t)

In the tuple, d is a set of devices (or sensors) that generate the events we are interested, which is a subset of all devices (d ⊆ {D1,D2,D3,...Dn}), the period of time (startTime and endTime) is the range of time to search the events, and the terms t refers to the set of words we are looking for in an event (t ⊆ S, being S the set of all strings in the events domain).

Some considerations about the components of the query are the following:

Devices, more precisely, the devices specified in the query and their corresponding are expected to be in the SLiCER archive. However, some unexpected cases may happen. For instance, if the device array is empty, SLiCER considers all existing devices for the query; or if the device array contains an unknown device, the system shows an error message and executes the query for all other specified devices.

Period of time, in which the startTime and endTime defines a time interval for the events. However, if one or both parameters are defined as null, an error message is show and the query processing is aborted; in the same way, if the startTime is in the place of endTime, an error message is show and the query processing is also aborted. Finally, if the startTime is before the beginning or endTime is after the end of range collected events in the archive, the search will return the events comprising to the entire archive.

4.4.5 Events Search Algorithm Algorithm 1 describes the procedure to execute a query for searching events, returning to the SOC analyst those events that match. Basically, for each device in query, the algorithm gets the names of blocks (lines 5 and 6), aiming to get the index files of the blocks within the specified time period (lines 7 to 9). Next, each resulting index file is accessed to check if the terms specified in the query are present in the block (line 10). If yes, the corresponding block is read from the cloud (line 11),

Page 48: D6.1 Preliminary Architecture and Service Model of ...disiem-project.eu/wp-content/uploads/2018/06/D6.1v2.pdf · Package 6, this report contains partial results for Tasks 6.1, 6.2

D6.1

48

decompressed, and the events that satisfy the terms are extracted. The resulting events are concatenated with the previous result (line 12). The process of searching is repeated for the next device.

4.5 Indexing Indexes are usually done by creating an auxiliary data structure, the index, used to locate the data in an efficient way without having to search the whole archive. In our case, the data comprises text events coming from different sensors in different formats, i.e., with different fields. This section presents two methods for indexing these contents, namely Bloom Filter and Lucene. The former is based on the events format, i.e., the fields that compose them, while the latter is associated with the events contents. Before presenting these index methods, we present how the system works if No index is used.

4.5.1 No Index The data model proposed for SLiCER makes a primary division of events by device and date-time, creating thus blocks of events. For some cases, this division can be sufficient to perform simple queries, for obtaining the right blocks without the need to access any type of index. For example, a query for getting all events from a device, for a specified period of time, does not need any index to get which blocks are needed to retrieve from the cloud. In this mode, the match function (line 10 of Algorithm 1) always returns true.

4.5.2 Bloom Filters

Page 49: D6.1 Preliminary Architecture and Service Model of ...disiem-project.eu/wp-content/uploads/2018/06/D6.1v2.pdf · Package 6, this report contains partial results for Tasks 6.1, 6.2

D6.1

49

Bloom filters [98] are probabilistic data structures used to represent sets supporting the operation if an element belongs to a predefined set of elements. Being probabilistic, it gives an answer if an element may be in the set or definitely is not in the set, meaning that it generates false positives, but does not generate false negatives. The creation of a bloom filter is based on a set of m elements, an array of n bits initially set to 0, and k hash functions. For each element, the k hash functions iterates over it, resulting k hashes with values between 0 and n, i.e., the indexes of the bit array. Next, the bits of these indexes in the bit array are set to 1. To check if an element e belongs to the set, the process of calculating the indexes for e is repeated for every one of the k hash functions, and the resulting indexes are verified in the bit array. If they are not, the element does not belong to the set, otherwise the element may be in the set (another element or some combination of other elements could have set the same bits). In the last case, the result of the bloom filter can be a false positive. Figure 12 shows an example of a bloom filter representation, using a 24 bit vector, 2 hash functions and 3 elements to insert (user1, user2 and user3). For example, the user1 element set to 1 the bits of indexes 1 and 11 (the grey squares represented in the array). The bottom part of the figure represents a verification of 3 elements – user1, user4 and user5 – using the bloom filter. The bloom filter outputs that both user1 and user5 may be in the set because their hashes match the bits set to 1, whereas user4 definitely is not in the set because one of their hashes does not match with any bit. However, while the result for user1 is correct, we have a false positive for user5.

Figure 12: Bloom Filter execution scenario.

A bloom filter is able to represent any element, independently of its size and value, using k indexes. Therefore, the size of a bloom filter only depends on dimension defined for the bit array (n bits) and the probability of false positives. For this reason, bloom filters are a very compact representation of sets. Furthermore, it is considered a fast and cost-efficiently structure for checking if an element is present in a certain set. The only disadvantage of it is the generation of false positives, making it imprecise. The false positives rate can be adjusted to a low percentage

Page 50: D6.1 Preliminary Architecture and Service Model of ...disiem-project.eu/wp-content/uploads/2018/06/D6.1v2.pdf · Package 6, this report contains partial results for Tasks 6.1, 6.2

D6.1

50

(e.g., 1%, 0.5%, 0.1%), but this implies a bigger bit array, increasing thus the index file size. Therefore, a balancing between these variables is needed to get smaller files and lower false positive rate. The bloom filter is an interesting option as an index method to be used in SLiCER. In a first approach, we propose to create a bloom filter for each event field defined by the SOC analyst during the SLiCER configuration. Then, when SLiCER creates a block of events, the bloom filters are generated and stored with the block. For example, if the username field is defined by SOC, a bloom filter containing all the distinct usernames belonging to all the events of a block is created. The same can be done by IP addresses or ports, either in by using a single bloom filter or with an index composed of several filters, one for each type of indexed field. Giving the high cost of downloading data from the cloud, SLiCER’ design needs to pay special attention with the possibility of false positives. More precisely, we want to avoid the following situation: given a term for searching, the bloom filter gives a positive answer about it for some block; next SLiCER requests the download of that block from the cloud, but the term does not exist there. This means that the bloom filter gave a false positive and an unnecessary reading was made from the cloud. This is the cost of having a small index, but since the precision can be adjusted, the bloom filter can offer a good cost-benefit.

4.5.3 Lucene Index Lucene [99] is an open source Java-based indexing and searching engine, maintained and distributed by Apache Foundation. It is a library and API professionally used to add text search capabilities to many products and websites. Some important features about Lucene are its capacity of data ingestion and processing (over 150GB/hour on modern hardware), low RAM requirements (only 1MB heap), and the size of the resulting index (roughly 20-30% the size of indexed text) [99]. Lucene provides indexing and searching over structured and unstructured documents, which can have any type of content, such as text (words, sentences), fields from formatted documents (e.g., fields of event logs), and numeric values. In addition, Lucene achieves fast results on searching because, instead of looking directly on documents, it searches on the indexes previously created from those documents. To index documents, Lucene collects all terms included in documents, excluding those that belong to the stopword list (e.g., “the” and “on” prepositions), and creates an index structure for them using an inverted index schema. This schema is used by Lucene during searching tasks, allowing the reverse mapping from terms to documents, i.e., returning which documents contain a given term. Figure 13 shows an indexing example over three documents, using a stopword list and outputting the inverted index. For instance, using the inverted index for searching the terms

Page 51: D6.1 Preliminary Architecture and Service Model of ...disiem-project.eu/wp-content/uploads/2018/06/D6.1v2.pdf · Package 6, this report contains partial results for Tasks 6.1, 6.2

D6.1

51

“butterfly” and “sky”, Lucene returns that the former is present at the in Document 1, while the latter appears on Document 2 and Document 3. Due to this inverted index, Lucene is both accurate and precise on searching task, producing no false positives or negatives. This method of indexing and searching is another option to be used by SLiCER to index the event contents of each block that it creates. We aim two forms of using Lucene. One form passes to configure Lucene to produce indexes based on values of the fields defined by the SOC analyst, meaning that first it is required to obtain all distinct values for each field, and pass them as indexing terms to Lucene. The other form is to create indexes with the default configuration of Lucene, i.e., using all event values as terms to it, but defining a stopword list. By using both forms, it is possible to perform more specific and complex queries, especially if the second form is chosen. However, even though the indexing capacities employed by this method allow accurate and precise results, they have a cost associated: the indexes sizes produced by Lucene are between 20-30% of the size of the indexed data, with default configuration and without compression. This factor increases the costs of storage and reading to/from the cloud, if the second form to create indexes is selected.

Figure 13: Indexing example performed by Lucene [102].

Therefore, we are going to study what is the best configuration to use the second form to reduce the costs. Also, a study of combining both forms will be done, i.e., SLiCER uses the first form to create indexes and send them to the cloud, and uses the second form to create indexes locally, for the blocks that it gets from the cloud based on the execution of queries under the first indexes. This means that a query is performed in two ways: 1) using some terms of the query to be used on first indexes to get blocks from the cloud; 2) using all terms of the query to be used on second

Page 52: D6.1 Preliminary Architecture and Service Model of ...disiem-project.eu/wp-content/uploads/2018/06/D6.1v2.pdf · Package 6, this report contains partial results for Tasks 6.1, 6.2

D6.1

52

indexes to get events from the blocks. In addition, as we mentioned in previous section, another study will be realized to combine bloom filter with Lucene.

4.6 Preliminary Cost Analysis This section presents an estimation of costs for storing and reading raw data (without encryption, compression, etc.) to/from the cloud based on the average of events generated by SIEMs of three partners of DiSIEM project (Atos, EDP and Amadeus), and the prices charged by Amazon S3 [85] (as a reference). Table 1 shows the prices charged by the Amazon S3 for storing, operations, and retrieval, as in July 2017. The table is composed by four sub-tables. The one presents the prices of monthly storage for three categories: Standard Storage, for frequent access; Standard Infrequent Access; and Glacier, for rare access. The second sub-table shows the prices of operations for Standard Storage and Infrequent Access Requests. For these two categories, all request made to the cloud are charged, except for delete operations. These operations are used, for instance, to request the storage of an event block (put) in the cloud. Copy and Post operations are also charged in this category (first line). In the second line it is shown, for example, the price of GET request. In the last line it is shown the price to retrieve data by gigabyte. This price is applied for retrievals to a VM running inside the Amazon infrastructure.

Table 1 : Reference prices charged by the Amazon S3 to store and read data [85]

The third sub-table shows the prices for Glacier. This service assumes that the data are rarely accessed, so it is stored in a different level of storage and the prices to retrieve it change according to the required data retrieval speed (Expedited, Standard and Bulk). The last sub-table shows the price to transfer data out from Amazon S3. This is the expected case of a typical SLiCER deployment. For instance, if we request a block of

Page 53: D6.1 Preliminary Architecture and Service Model of ...disiem-project.eu/wp-content/uploads/2018/06/D6.1v2.pdf · Package 6, this report contains partial results for Tasks 6.1, 6.2

D6.1

53

500 megabytes from the cloud, we will be charged by the data transfer out of the cloud (about $0,045) and by the request made (GET). In Table 2 we present the amount of data (on average) generated by each partner. The second and third columns show the number of events per second and per month, respectively. The difference of the number of events generated between three partners is considerable, which is going to reflect in storage costs. The next three columns represent these events in amount of data (GB), by hour, day, and month. The last column shows the amount of data (TB) stored after collecting events for 5 years, which will be stabilized after this time and start its data rotation, discarding the oldest data and ingesting the newest one. We are not considering that a percentage of data can increase (monthly or yearly), so this study only gives a provision about the amount of storage required. We considered 800 bytes as being the average size of an event, taken as reference the average of the events size exported by EDP’s ArcSight.

Table 2 : Amount of data generated by partners of the DiSIEM project

Table 3 gives an estimative of costs after five years of storing, using the reference prices of Table 1 and considering the amount of data of each partner require (Table 2). The second column shows the cost of storage by month, considering 6 months of storage in the Standard Infrequent Access category, which possibly is the category that has more access, because it keeps the recent events of the SLiCER. The third column shows an estimative of the amount of data retrieved by month. This value was calculated on 10% of the amount of data stored in a month and their cost. This value is variable and depends on how much data each partner will retrieve in a month. The next two columns represent the volume of estimated requests by month and their cost. These values can be variable and depend on how much data each partner will retrieve in a month. The last two columns show the total costs by month and year for storing and requesting data.

Table 3 : Estimate costs for storing data in the cloud after 5 years

The presented values do not consider other factors that have impact in the costs. On one hand, the data compression can decrease significantly the amount of data to store, so the costs can decrease. On the other hand, the erasure codes used by the multi-cloud storage algorithm and the indexes can increase the amount of data to store, so the costs can increase. A detailed study about this will be performed in the next months of DiSIEM.

Page 54: D6.1 Preliminary Architecture and Service Model of ...disiem-project.eu/wp-content/uploads/2018/06/D6.1v2.pdf · Package 6, this report contains partial results for Tasks 6.1, 6.2

5 Summary and Conclusions In this report, we presented the results of the work carried as part of the Infrastructure Enhancement Work Package. We presented the architecture and the service model for four extensions to address known SIEM limitations: Two Enhanced Application Monitoring extensions to model user behaviors both on the application and network levels to detect anomalous activities, a Diverse Monitoring extension to assess the reliability of combining monitoring systems and predict future alerts, and an extension for secure cost-effective long-term SIEM event archival, using public cloud storage services. This document will be used as a reference for the development of the extensions, for the testing and the SIEM integration.

Page 55: D6.1 Preliminary Architecture and Service Model of ...disiem-project.eu/wp-content/uploads/2018/06/D6.1v2.pdf · Package 6, this report contains partial results for Tasks 6.1, 6.2

6 References [1] Gartner, "Magic Quadrant for Security Information and Event Management,"

2016. [2] V. Chandola, A. Banerjee and V. Kumar, "Anomaly Detection : A Survey," 2009. [3] S. Omar, A. Ngadi and H. H. Jebur, "Machine Learning Techniques for Anomaly

Detection," International journal of computer applications, vol. 79, no. 2, pp. 35, 41, 2013.

[4] A. Ghorbani, M. Tavallaee and W. Lu, Network Intrusion Detection and Prevention: Concepts and Techniques, 2009.

[5] M. Goldstein and S. Uchida, "A Comparative Evaluation of Unsupervised Anomaly Detection Algorithms for Multivariate Data," 2016.

[6] F. J. Anscombe and I. Guttman, "Rejection of Outliers," Technometrics, vol. 2, no. 2, pp. 123, 147, 1960.

[7] E. Parzen, "On estimation of a Probability Density Function and Mode," Annals of Mathematical Statistics, vol. 33, no. 3, pp. 1065, 1076, 1962.

[8] H. Motulsky, Intuitive Biostatistics: A Nonmathematical Guide to Statistical Thinking, 1995.

[9] M. H. Bhuyan, D. K. Bhattacharyya and J. K. Kalita, "Network Anomaly Detection: Methods, Systems and Tools," IEEE Communications Surveys & Tutorials, vol. 16, no. 1, pp. 303 - 336, 2014.

[10] A. Juvonen and T. Sipola, "Anomaly Detection Framework Using Rule Extraction," 2014.

[11] P. Naldurg, K. Sen and P. Thati, "A Temporal Logic Based Framework for Intrusion Detection," in International Conference on Formal Techniques for Networked and Distributed Systems, 2004.

[12] J. M. Estevez-Tapiador, P. Garcya-Teodoro and J. E. Dyaz-Verdejo, "Stochastic protocol modeling for anomaly based network intrusion detection," in Information Assurance, 2003.

[13] Shao-ShinHung and D.-M. Liu, "A user-oriented ontology-based approach for network intrusion detection," Computer Standards & Interfaces, vol. 30, no. 1-2, pp. 78-88, 2008.

[14] D. Heckerman, "A tutorial on Learning With Bayesian Networks," 1996. [15] K. Johansen and S. Lee, "CS424 Network Security: Bayesian Network Intrusion

Detection," 2003. [16] D. Zuev and A. W. Moore, "Internet Traffic Classification Using Bayesian Analysis

Techniques," Sigmetrics, vol. 33, no. 1, pp. 50-60, 2005. [17] J. R. Quinlan, "Induction of decision trees," 1986. [18] V. N. Vapnik, "Statistical learning theory," IEEE TRANSACTIONS ON NEURAL

NETWORKS, vol. 10, no. 5, pp. 998-999, 1999. [19] H. Li, "Research and Implementation of an Anomaly Detection Model Based on

Page 56: D6.1 Preliminary Architecture and Service Model of ...disiem-project.eu/wp-content/uploads/2018/06/D6.1v2.pdf · Package 6, this report contains partial results for Tasks 6.1, 6.2

D6.1

56

Clustering Analysis," in Intelligence Information Processing and Trusted Computing (IPTC), 2010.

[20] C. Zhang, G. Zhang and S. Sun, "A Mixed Unsupervised Clustering-based Intrusion Detection Model," in Genetic and Evolutionary Computing, 2009.

[21] J. C. Dunn, "A fuzzy relative of the ISO data process and its use in detecting compact well-separated clusters," Journal of Cybernetics , vol. 3, no. 3, pp. 32-57, 1973.

[22] J. C. Bezdek, Pattern recognition with fuzzy objective function algorithms, 2013. [23] S. Mabu, C. Chen, N. Lu, K. Shimada and K. Hirasawa, "An Intrusion Detection

Model Based on Fuzzy Class Association Rule Mining Using Genetic Network," IEEE Transactions on Systems, Man, and Cybernetics, vol. 41, no. 1, 2011.

[24] Y.-P. Zhou and J.-A. Fang, "Intrusion Detection Model Based on Hierarchical Fuzzy Inference System," in Information and Computing Science, 2009.

[25] T. Kohonen, "Self-Organizing Map," Neurocomputing, vol. 21, no. 1-3, pp. 1-6, 1998.

[26] G. A. Carpenter and S. Grossberg, Adaptive resonance theory, 2003. [27] O. Nasraoui, E. Leon and R. Krishnapuram, "Unsupervised Niche Clustering:

Discovering an Unknown Number of Clusters in Noisy Data Sets," 2005. [28] B. Schölkopf, J. C. Platt, J. Shawe-Taylor, A. J. Smola and R. C. Williamson,

"Estimating the support of a high-dimensional distribution," Neural Computation, vol. 13, no. 7, pp. 1443-1471, 2006.

[29] H. Tong, D. Koutra and L. Akoglu, "Graph based Anomaly Detection and Description: A Survey," Data Mining and Knowledge Discovery, vol. 29, no. 3, p. 626–688, 2015.

[30] D. Hawkins, Identification of outliers, 1980. [31] C. C. Aggarwal and S. Sathe, Outlier Ensembles: An Introduction, 2017. [32] A. Zimek, R. J. G. B. Campello and J. Sander, "Ensembles for Unsupervised

Outlier Detection," ACM SIGKDD Explorations Newsletter, vol. 15, no. 1, pp. 11-22 , 2013.

[33] E. Schubert, R. Wojdanowski, A. Zimek and H.-P. Kriegel, "On Evaluation of Outlier Rankings and Outlier Scores," in SIAM International Conference on Data Mining, 2012.

[34] P. M. Mafra, V. Moll, J. d. S. Fraga and A. O. Santin, "Octopus-IIDS: An Anomaly Based Intelligent Intrusion Detection System," in Computers and Communications (ISCC), 2010.

[35] S. Chebrolu, A. Abraham and J. P. Thomas, "Feature deduction and ensemble design of intrusion detection systems," Computers & Security, vol. 24, no. 4, pp. 295-307, 2005.

[36] G. Folino, C. Pizzuti and G. Spezzano, "An ensemble-based evolutionary framework for coping with distributed intrusion detection".

[37] H. Nguyen, N. Harbi and J. Darmont, "An efficient local region and clustering-based ensemble system for intrusion detection," in ACM. 15th International Database Engineering, Lisbon, Portugal, 2011.

Page 57: D6.1 Preliminary Architecture and Service Model of ...disiem-project.eu/wp-content/uploads/2018/06/D6.1v2.pdf · Package 6, this report contains partial results for Tasks 6.1, 6.2

D6.1

57

[38] H. Sengar, X. Wang, H. Wang, D. Wijesekera and S. Jajodia, "Online Detection of Network Traffic Anomalies Using Behavioral Distance," in 17th International Workshop on Quality of Service, 2009.

[39] P. Barford, J. Kline, D. Plonka and A. Ron, "A signal analysis of network traffic anomalies," in ACM SIGCOMM Internet Measurement Workshop, 2002.

[40] S. Farraposo, P. Owezarski and E. Monteiro, "An Approach to Detect Traffic Anomalies," in Conference on Network Architecture and Information Systems Security, SAR-SSI, 2007.

[41] A. Lakhina, M. Crovella and C. Diot, "Mining anomalies using traffic feature distributions," in SIGCOM, 2005.

[42] T. Ahmed, B. Oreshkin and M. Coates, "Machine learning approaches to network anomaly detection," in 2nd USENIX workshop on Tackling computer systems problems with machine learning techniques, 2007.

[43] A. Lakhina, M. Crovella and C. Diot, "Diagnosing Network-Wide Traffic Anomalies," in Conference on Applications, technologies, architectures, and protocols for computer communications, 2004.

[44] D. Brauckhoff, "Network Traffic Anomaly Detection and Evaluation," PhD Thesis, DTE Zurich, 2010.

[45] N.Hachem, "MPLS-based mitigation technique to handle cyber-attacks," PhD Thesis, Télécom Sudparis and Paris VI University, 2014.

[46] A. Wagner and B. Plattner, "Entropy Based Worm and Anomaly Detection in Fast IP Networks," in International Workshops on Enabling Technologies: Infrastructure for Collaborative Enterprise, 2005.

[47] S. Ranjan, S. Shah, A. Nucci, M. Munafò, R. Cruz and S. Muthukrishnan, "DoWitcher: Effective Worm Detection and Containment in the Internet Core," in Conference on Computer Communications, 2007.

[48] A. Kind, M. P. Stoecklin and X. Dimitropoulos, "Histogram-Based Traffic Anomaly Detection," Transactions on Network and Service Management, vol. 6, pp. 110-121, 2009.

[49] D. Brauckhoff, X. Dimitropoulos, A. Wagner and K. Salamatian, "Anomaly Extraction in Backbone Networks using Association Rules," in 9th ACM SIGCOMM Internet Measurement Conference, 2009.

[50] S. Weigert, M. Hiltunen and C. Fetzer, "Community-based Analysis of Netflow for Early Detection of Security Incidents," in 25th USENIX Large Installation Systems Administration Conference, 2011.

[51] G. Münz, S. Li and G. Carle, "Traffic Anomaly Detection Using K-Means Clustering," in In GI/ITG Workshop MMBnet, 2007.

[52] C. Wagner, J. François, R. State and T. Engel, "Machine Learning Approach for IP-Flow Record Anomaly Detection," in International IFIP TC 6 Conference on Networking, 2011.

[53] P. Duessel, C. Gehl, U. Flegel, S. Dietrich and M. Meier, "Detecting Zero-Day Attacks Using Context-Aware Anomaly Detection At Application-Layer," International Journal of Information Security, vol. 16, no. 5, p. 475–490, 2017.

[54] K. Rieck, "Machine Learning for Application-Layer Intrusion Detection," Berlin,

Page 58: D6.1 Preliminary Architecture and Service Model of ...disiem-project.eu/wp-content/uploads/2018/06/D6.1v2.pdf · Package 6, this report contains partial results for Tasks 6.1, 6.2

D6.1

58

Germany, 2009. [55] D. A. Reynolds, T. F. Quatieri and R. B. Dunn, Speaker verification using adapted

gaussian mixture models, 2000. [56] C. Kruegel and G. Vigna, "Anomaly Detection of Web-based Attacks," in 10th

ACM conference on Computer and communications security, 2003. [57] "OWASP," [Online]. Available: https://www.owasp.org/index.php/Main_Page. [58] "ModSecurity," [Online]. Available: https://www.modsecurity.org/. [59] D. DUBOIS and H. PRADE, "What Are Fuzzy Rules and How to Use Them," Fuzzy

Sets and Systems, vol. 84, no. 2, pp. 169-185, 1996. [60] O. Thonnard and J. Zouaoiu, "Skeptic: Fraud Detection in LSS," AQG Newsletter

LSS FD Framework. [61] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M.

Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas and A. Passos, "Scikit-learn: Machine Learning in Python," Journal of Machine Learning Research, vol. 12, pp. 2825-2830, 2011.

[62] B. Schölkopf, J. C. Platt, J. Shawe-Taylor, A. J. Smola and R. C. Williamson, "Estimating the support of a high-dimensional distribution," Journal of Neural computation, vol. 13, pp. 1443-1471, 2001.

[63] "Novelty and Outlier Detection," Scikit Learn, [Online]. Available: http://scikit-learn.org/stable/modules/outlier_detection.html.

[64] "Novelty and Outlier Detection," Scikit Learn, 2017. [Online]. Available: http://scikit-learn.org/stable/modules/outlier_detection.html.

[65] P. J. Rousseeuw and K. V. Driessen, "A Fast Algorithm for the Minimum Covariance Determinant Estimator," Technometrics, vol. 41, pp. 212-223, 1998.

[66] F. T. Liu, M. Kai and Z. H. Zhou, "Isolation forest," in 8th IEEE International Conference on Data Mining, 2008.

[67] A. L. Zadeh, Fuzzy sets, 1965. [68] DeZyre, "Principal Component Analysis Tutorial," [Online]. Available:

https://www.dezyre.com/data-science-in-python-tutorial/principal-component-analysis-tutorial. [Accessed 2017].

[69] "What is Event Log Correlation?," Alien Vault, [Online]. Available: https://www.alienvault.com/resource-center/videos/what-is-event-log-correlation.

[70] "Don’t Trade One Security Analytics Platform Problem for Another," Security Intelligence, [Online]. Available: https://securityintelligence.com/dont-trade-one-security-analytics-platform-problem-another/.

[71] "Do your logs reveal what matters?," Loggly, [Online]. Available: https://www.loggly.com/product/.

[72] "Security IQ," Infosec Security, [Online]. Available: http://resources.infosecinstitute.com/top-6-seim-use-cases/#gref.

[73] A. Gashi and I. Andongabo, "vepRisk - A Web Based Analysis Tool for Public Security Data.," in 13th European Dependable Computing Conference, Geneva, Switzerland., 2017.

Page 59: D6.1 Preliminary Architecture and Service Model of ...disiem-project.eu/wp-content/uploads/2018/06/D6.1v2.pdf · Package 6, this report contains partial results for Tasks 6.1, 6.2

D6.1

59

[74] "Handbook of Software Reliability Engineering," IEEE Computer Society Press and McGraw-Hill Book Company , [Online]. Available: http://www.cse.cuhk.edu.hk/~lyu/book/reliability/index.html.

[75] "Möbius Overview," Möbius, [Online]. Available: https://www.mobius.illinois.edu/.

[76] P. Popov, "Stochastic Modeling of Safety and Security of the e-Motor," in 34th International Conference on Computer Safetu, Reliability and Security, SACECOMP 2015, 2015.

[77] M. D. Ford, K. Keefe, E. LeMay, W. H. Sanders and C. Muehrcke, "Implementing the ADVISE security modeling formalism in Möbius," in 43rd IEE Proceedings of the International Conference on Dependable Systems and Networks, 2013.

[78] "Perform Papers," University of Illonois at Urbana-Champaign, [Online]. Available: https://www.perform.illinois.edu/papers.html#vulnchek .

[79] "Best practices for configuring yout usm installation," Inc. Alienvault, [Online]. Available: https://www.alienvault.com/forums/discussion/6705/q-a-from-webcast- best-practices-for-configuring-your-usm-installation.

[80] "Arcsight data retention," Ben Walther, [Online]. Available: https://wikis.uit.tufts.edu/confluence/display/exchange2010/ArcSight+ Data+Retention.

[81] L. Bilge and T. Dumitras, "Before we knew it: an empirical study of zero-day attacks in the real world," in ACM Conference on Computer an Communciations Security, 2012.

[82] L. Ablon and A. Bogart, "Zero Days, Thousands of Nights: The Life and Times of Zero-Day Vulnerabilities and Their Exploits," 2017.

[83] S. Gordon, "Siem best practices to work," 2010. [Online]. Available: https://www.eslared.org.ve/walc2012/material/track4/Monitoreo/ Top 10 SIEM Best Practices.pdf.

[84] K. Kent and M. Souppaya, "Nist-guide to computer security log management," 2016. [Online]. Available: https://nvlpubs.nist.gov/nistpubs/Legacy/SP/nistspecialpublication800- 92.pdf.

[85] "Amazon S3 Pricing," Inc Amazon Web Services, [Online]. Available: https://aws.amazon.com/s3/pricing/..

[86] L. Rizzo, "Effective erasure codes for reliable computer communication protocols," 1997.

[87] A. Bessani, M. Correia, B. Quaresma, F. Andre and P. Sousa, "Depsky: dependable and secure storage in a cloud-of-clouds. ACM Transactions on Storage," 2013.

[88] "Arcsight smart-connector supported products," L.P. Hewlett-Packart Development Company, 2013. [Online]. Available: https://www.binss.de/wp-content/uploads/HP-ArcSight- SmartConnectors-Supported-Products.pdf.

[89] "Usm appliance deployment guide," Inc. Alien Vault, 2017. [Online]. Available: https://www.alienvault.com/documentation/resources/pdf/usm-appliance-deployment-guide.pdf.

[90] "Getting Data In," Inc. Splunk, 2017. [Online]. Available: https://docs.splunk.com

Page 60: D6.1 Preliminary Architecture and Service Model of ...disiem-project.eu/wp-content/uploads/2018/06/D6.1v2.pdf · Package 6, this report contains partial results for Tasks 6.1, 6.2

D6.1

60

/Documentation/Splunk/6.0.3/Data/WhatSplunkcanmonitor. [91] C. Gormley and Z. Tong, "Elasticsearch: The Definitive Guide: A Distributed Real-

Time Search and Analytics Engine," O'Reilly Media Inc., 2015. [92] "Hpe securoty arcsight esm," L.P. Hewlett-Packard Development Company,

2017. [Online]. [93] "Usm appliance deployment guide," Inc. Alien Vault, 2017. [Online]. Available:

https://www.splunk.com/pdfs/white-papers/splunk-enterprise-on- aws-deployment-guidelines.pdf.

[94] "Splunk Cloud," Inc. Splunk, 2017. [Online]. Available: https://www.splunk.com/en_us/products/splunk-cloud.html.

[95] "Elastic Cloud," Elastic, 2017. [Online]. Available: https://www.elastic.co/cloud . [96] T. Oliveira, R. Mendes and A. Bessani, "Exploring key-value stores in multi-writer

byzantine-resilient register emulations," in 20th International Conference On Principled of Distributed Systems, 2016.

[97] D. Malkhi and M. Reiter, "Byzantine quorum systems," Distributed Cmputing, 1998.

[98] B. H. Bloom, "Space/time trade-offs in hash coding with allowable errors," Communications of the ACM, 1970.

[99] "Apache Lucene 6.6.0 documentation," The Apache Software Foundation, 2017. [Online]. Available: http://lucene.apache.org/core/6 6 0/.