Selem Trabelsi & Walha Amal
-
Upload
salem-trabelsi -
Category
Documents
-
view
236 -
download
0
Transcript of Selem Trabelsi & Walha Amal
-
7/29/2019 Selem Trabelsi & Walha Amal
1/25
A novel intrusion detection system based on
hierarchical clustering and
support vector machines
Prsent par:
proposed by :Mr Abbes Tarek
-
7/29/2019 Selem Trabelsi & Walha Amal
2/25
PLAN
-
7/29/2019 Selem Trabelsi & Walha Amal
3/25
Introduction
-
7/29/2019 Selem Trabelsi & Walha Amal
4/25
NIDS is a network intrusion detection system that attempts to detect
malicious activitiesA problem of NIDS is that detects only known network attacks.
Many methods proposed on the design of NIDS such as:
Introduction
1
This study proposes a SVM-based intrusion detection system based on ahierarchical clustering algorithm to preprocess the KDD Cup 1999
dataset before SVM training.
decision tree based on C5 used by the KDD Cup 1999
Fuzzy rough C-means (FRCM) based on the fuzzy set theory.
support vector machine (SVM)
-
7/29/2019 Selem Trabelsi & Walha Amal
5/25
-
7/29/2019 Selem Trabelsi & Walha Amal
6/25
Designed for very large data sets
Time and memory are limitedIncremental and dynamic clustering of incoming objectsOnly one scan of data is necessaryConstructs a tree called a clustering feature (CF) treeDoes not need the whole data set in advanceAble to handle noise effectively.
BIRCH hierarchical clustering alghorithm
Two key phases:Scans the database to build an in-memory treeApplies clustering algorithm to cluster the leaf nodes
2
-
7/29/2019 Selem Trabelsi & Walha Amal
7/25
Clustering feature (CF)Node in the CF tree composed of clustering feature
A CF is a triplet summarizes the information of a cluster.
Given a cluster of instances we define:
Centroid:
Radius:
Euclidean distance :
Theorem to merge sub-clusters:
CF1 + CF2= (n1+n2, LS1+LS2, SS1+SS2)
NP
i
NP
i
i
i
PSS
PLS
2
CF = (n, LS, SS)
3
-
7/29/2019 Selem Trabelsi & Walha Amal
8/25
4
A CF tree is a compact representation of a dataset
The insertion procedure of CF tree has three steps.
leaf node represents a cluster that absorbs many data point and must
satisfy the threshold requirement T.
Each non-leaf node contains B entries of the form (CFi, child i )
-
7/29/2019 Selem Trabelsi & Walha Amal
9/25
5
Step 1: Identify the appropriate leaf
It starts from the root and traverses the CF tree recursively down to the
leaf level by choosing the child node, whose centroid is closest at each
level to the new entry.
The algorithm computes the distances between CFx and each entry
-
7/29/2019 Selem Trabelsi & Walha Amal
10/25
6
three possible options:
leaf absorbs the new entry without violating the radius threshold T
condition. add a new leaf entry for the new entry.
If adding a new entry violates the branching factor B threshold the
leaf node has to split by choosing the farthest pair of entries as seeds.
Step 2: Modify the leaf
-
7/29/2019 Selem Trabelsi & Walha Amal
11/25
7
Step 3: Modify entries on the path to the leaf
oAfter inserting the new entry into a leaf node, the algorithm
needs to update the CF information for each non-leaf entry alongthe path backwards to the root.
oIf no leaf node splitting is invoked, the algorithm simply adds
the CF to reflect the addition of the new entry. If leaf nodesplitting is invoked, the algorithm checks whether the parent node
meets the branching factor constraint.
o
If the parent node violates the B threshold, it is split andrecursively traversed back to the root, while performing the
same checks.
-
7/29/2019 Selem Trabelsi & Walha Amal
12/25
-
7/29/2019 Selem Trabelsi & Walha Amal
13/25
A SVM is a supervised learning method. It performs classification by
constructing an N -dimensional hyperplane that optimally separates the data
into different categories.
Support vector machines (SVM)
8
The training complexity is very dependent on the size of the dataset
SVM have shown good results in data classification , but it is unable tooperate at such a large dataset due to system failures caused by
insufficient memory
-
7/29/2019 Selem Trabelsi & Walha Amal
14/25
Is a combination of support vector machine and hierarchical clustering to
build an intrusion detection system.The BIRCH algorithm has to transform dataset to a smaller sized and used
to produce a reduced and high quality dataset before SVM training.
The process of the proposed system is described as follows:
(1) Transform and scale data
(2) Construct CF trees for attacks and normal packets
(3) Do feature selection for each type of attacks.
(4) Train four SVM classifiers by the centroid of all entries in leaf nodes of
CF trees.(5) Combine the four SVMs classifiers to build an intrusion detection system
9
SVM with hierarchical clustering
f d l
-
7/29/2019 Selem Trabelsi & Walha Amal
15/25
SVM requires each data point to be represented as a vector of real numbers.
every non-numerical attribute has to be transformed into numerical data first.
For example, the protocol_type attribute in KDD Cup 1999, tcp is changed
with 0, udp with 1, and icmp with 2.
Data transformation and scaling
Data scaling can avoid attributes with greater values dominating those
attributes with smaller values.
each attribute is called with linear scaling to the range of [0, 1] by
dividing every attribute value by its own maximum value.
10
i
-
7/29/2019 Selem Trabelsi & Walha Amal
16/25
CF trees construction
The CF trees can be constructed with a single scan of the dataset.
four kinds of attacks in KDD Cup:
DoS: Denial of ServiceR2L: illegitimate access from a remote machine
U2R: Acquire the privileges of a super user
Probing: scan of port
One CF tree for normal taffic
In the KDD Cup 1999 data set there are 41 features
LS and SS are contain the sum and square sum value of each of the features 10
l i
-
7/29/2019 Selem Trabelsi & Walha Amal
17/25
Feature selection
11
not all features are needed in the design of a network intrusion
detection system.
It is critical to identify important features of networktraffic
data.
a feature is important is determined based on the accuracy and
the number of false positives of the system, with and without the
feature.
-
7/29/2019 Selem Trabelsi & Walha Amal
18/25
-
7/29/2019 Selem Trabelsi & Walha Amal
19/25
12
The datasets contained 24 training attack types classified into
four kinds of attacks, DoS, U2R, R2L, and Probe.
The best performance of this system in terms of accuracy was 95.72%
with only a 0.73% false positive rate.
this system showed superior performance in DoS and Probe attacks and
suffered from both U2R and R2L attacks because the numbers of
instances for these two attacks were too small in the original KDD Cup
1999 dataset.
-
7/29/2019 Selem Trabelsi & Walha Amal
20/25
13
This table presents the original numbers of instance for each kind of attack
and the numbers of instances by CF trees with different threshold T
The hierarchical clustering reduces the number of instances for datasets.For exemple with a CF tree (T=0.2) the number of instances of Dos is very
small compared to original data set (only 271)
-
7/29/2019 Selem Trabelsi & Walha Amal
21/25
14
This table compares the KDD Cup 1999 winners system and other
researches, this system provided the best detection rate for DoS andProbe attacks.
The ESC-IDS showed the best detection rate for the R2L attack, and
Multi-classifier showed the best detection rate for the U2R attack.
in terms of accuracy, this system could achieve the best performance.
-
7/29/2019 Selem Trabelsi & Walha Amal
22/25
15
As shown in Table , the
IDS detection rate of this
study for new attacks is
only 39.04%, and theworst detection is on
new R2L attacks.
-
7/29/2019 Selem Trabelsi & Walha Amal
23/25
-
7/29/2019 Selem Trabelsi & Walha Amal
24/25
CONCLUSIONET
PERSPECTIVES
Many researches concerning NIDSs applied SVMs because SVMs are wellknown for their generalization performances.
this study proposed an SVM-based network intrusion detection system with
BIRCH hierarchical clustering for data preprocessing.
The BIRCH hierarchical clustering could provide highly qualified, abstracted
and reduced datasets to the SVM training
the resultant SVM classifiers showed better performance than the SVM
classifiers using the originally redundant dataset.
16
-
7/29/2019 Selem Trabelsi & Walha Amal
25/25