Selem Trabelsi & Walha Amal

7/29/2019 Selem Trabelsi & Walha Amal

1/25

A novel intrusion detection system based on

hierarchical clustering and

support vector machines

Prsent par:

proposed by :Mr Abbes Tarek


2/25

PLAN


3/25

Introduction


4/25

NIDS is a network intrusion detection system that attempts to detect

malicious activitiesA problem of NIDS is that detects only known network attacks.

Many methods proposed on the design of NIDS such as:

Introduction

1

This study proposes a SVM-based intrusion detection system based on ahierarchical clustering algorithm to preprocess the KDD Cup 1999

dataset before SVM training.

decision tree based on C5 used by the KDD Cup 1999

Fuzzy rough C-means (FRCM) based on the fuzzy set theory.

support vector machine (SVM)


5/25


6/25

Designed for very large data sets

Time and memory are limitedIncremental and dynamic clustering of incoming objectsOnly one scan of data is necessaryConstructs a tree called a clustering feature (CF) treeDoes not need the whole data set in advanceAble to handle noise effectively.

BIRCH hierarchical clustering alghorithm

Two key phases:Scans the database to build an in-memory treeApplies clustering algorithm to cluster the leaf nodes

2


7/25

Clustering feature (CF)Node in the CF tree composed of clustering feature

A CF is a triplet summarizes the information of a cluster.

Given a cluster of instances we define:

Centroid:

Radius:

Euclidean distance :

Theorem to merge sub-clusters:

CF1 + CF2= (n1+n2, LS1+LS2, SS1+SS2)

NP

i

NP

i

i

i

PSS

PLS

2

CF = (n, LS, SS)

3


8/25

4

A CF tree is a compact representation of a dataset

The insertion procedure of CF tree has three steps.

leaf node represents a cluster that absorbs many data point and must

satisfy the threshold requirement T.

Each non-leaf node contains B entries of the form (CFi, child i )


9/25

5

Step 1: Identify the appropriate leaf

It starts from the root and traverses the CF tree recursively down to the

leaf level by choosing the child node, whose centroid is closest at each

level to the new entry.

The algorithm computes the distances between CFx and each entry


10/25

6

three possible options:

leaf absorbs the new entry without violating the radius threshold T

condition. add a new leaf entry for the new entry.

If adding a new entry violates the branching factor B threshold the

leaf node has to split by choosing the farthest pair of entries as seeds.

Step 2: Modify the leaf


11/25

7

Step 3: Modify entries on the path to the leaf

oAfter inserting the new entry into a leaf node, the algorithm

needs to update the CF information for each non-leaf entry alongthe path backwards to the root.

oIf no leaf node splitting is invoked, the algorithm simply adds

the CF to reflect the addition of the new entry. If leaf nodesplitting is invoked, the algorithm checks whether the parent node

meets the branching factor constraint.

o

If the parent node violates the B threshold, it is split andrecursively traversed back to the root, while performing the

same checks.


12/25


13/25

A SVM is a supervised learning method. It performs classification by

constructing an N -dimensional hyperplane that optimally separates the data

into different categories.

Support vector machines (SVM)

8

The training complexity is very dependent on the size of the dataset

SVM have shown good results in data classification , but it is unable tooperate at such a large dataset due to system failures caused by

insufficient memory


14/25

Is a combination of support vector machine and hierarchical clustering to

build an intrusion detection system.The BIRCH algorithm has to transform dataset to a smaller sized and used

to produce a reduced and high quality dataset before SVM training.

The process of the proposed system is described as follows:

(1) Transform and scale data

(2) Construct CF trees for attacks and normal packets

(3) Do feature selection for each type of attacks.

(4) Train four SVM classifiers by the centroid of all entries in leaf nodes of

CF trees.(5) Combine the four SVMs classifiers to build an intrusion detection system

9

SVM with hierarchical clustering

f d l


15/25

SVM requires each data point to be represented as a vector of real numbers.

every non-numerical attribute has to be transformed into numerical data first.

For example, the protocol_type attribute in KDD Cup 1999, tcp is changed

with 0, udp with 1, and icmp with 2.

Data transformation and scaling

Data scaling can avoid attributes with greater values dominating those

attributes with smaller values.

each attribute is called with linear scaling to the range of [0, 1] by

dividing every attribute value by its own maximum value.

10

i


16/25

CF trees construction

The CF trees can be constructed with a single scan of the dataset.

four kinds of attacks in KDD Cup:

DoS: Denial of ServiceR2L: illegitimate access from a remote machine

U2R: Acquire the privileges of a super user

Probing: scan of port

One CF tree for normal taffic

In the KDD Cup 1999 data set there are 41 features

LS and SS are contain the sum and square sum value of each of the features 10

l i


17/25

Feature selection

11

not all features are needed in the design of a network intrusion

detection system.

It is critical to identify important features of networktraffic

data.

a feature is important is determined based on the accuracy and

the number of false positives of the system, with and without the

feature.


18/25


19/25

12

The datasets contained 24 training attack types classified into

four kinds of attacks, DoS, U2R, R2L, and Probe.

The best performance of this system in terms of accuracy was 95.72%

with only a 0.73% false positive rate.

this system showed superior performance in DoS and Probe attacks and

suffered from both U2R and R2L attacks because the numbers of

instances for these two attacks were too small in the original KDD Cup

1999 dataset.


20/25

13

This table presents the original numbers of instance for each kind of attack

and the numbers of instances by CF trees with different threshold T

The hierarchical clustering reduces the number of instances for datasets.For exemple with a CF tree (T=0.2) the number of instances of Dos is very

small compared to original data set (only 271)


21/25

14

This table compares the KDD Cup 1999 winners system and other

researches, this system provided the best detection rate for DoS andProbe attacks.

The ESC-IDS showed the best detection rate for the R2L attack, and

Multi-classifier showed the best detection rate for the U2R attack.

in terms of accuracy, this system could achieve the best performance.


22/25

15

As shown in Table , the

IDS detection rate of this

study for new attacks is

only 39.04%, and theworst detection is on

new R2L attacks.


23/25


24/25

CONCLUSIONET

PERSPECTIVES

Many researches concerning NIDSs applied SVMs because SVMs are wellknown for their generalization performances.

this study proposed an SVM-based network intrusion detection system with

BIRCH hierarchical clustering for data preprocessing.

The BIRCH hierarchical clustering could provide highly qualified, abstracted

and reduced datasets to the SVM training

the resultant SVM classifiers showed better performance than the SVM

classifiers using the originally redundant dataset.

16


25/25

Selem Trabelsi & Walha Amal

Documents

Transcript of Selem Trabelsi & Walha Amal