Analysis and Design for Intrusion Detection System IDS Using Data Mining IEEE 2010
Transcript of Analysis and Design for Intrusion Detection System IDS Using Data Mining IEEE 2010
-
8/11/2019 Analysis and Design for Intrusion Detection System IDS Using Data Mining IEEE 2010
1/4
Analysis and Design for Intrusion Detection System
Based on Data Mining
Duanyang Zhao, Qingxiang Xu, Zhilin Feng
Zhijiang College of Zhejiang University of Technology
Hangzhou, Zhejiang Province, 310024, China{sunny, xqx, fengzl}@zjc.zjut.edu.cn
AbstractNetwork and host Intrusion Detection Systems (IDS)
have become a standard component in security infrastructures.
As the action of intrusion represents variable, complicated, and
uncertainty characteristic, they face so many problems to resolve
for intrusion detection. Each approach has its strengths and
weaknesses. A truly effective intrusion detection system will
employ both technologies. We discusses the differences in host-
and network-based intrusion detection techniques to demonstratehow the two can work together to provide additionally effective
intrusion detection and protection. We propose a hybrid IDS,
which combines network and host IDS, with anomaly and misuse
detection mode, utilizes auditing programs to extract an extensive
set of features that describe each network connection or host
session, and applies data mining programs to learn rules that
accurately capture the behavior of intrusions and normal
activities.
Keywords-intrusion detection; hybrid ids; data mining; analysis
engine; apriori algorithm
I. INTRODUCTION
Apriori algorithm in data mining can show that theattribute-values frequently appear together in a given data set.It can mine the relationships between attribute values from adatabase table, and is more suitable method for intrusiondetection system.
The most representative of the research in the world isWenke Lee Research Group in Columbia University [1][2],1998. They were supported by the Defense Advanced ResearchProjects Agency (DARPA) and the National Natural ScienceFoundation (NSF) funding, and focused on the research in thisarea. Since then, the IDS Research Group of the Department ofComputer Science, under the leadership of Professor SalvatoreJ. Stolfo, carried out extensive study on data mining-based IDS.They have been divided their research into twelve sub-topics.Their research is on top in the world. The SANS (SystemAdmin, Audit, Network, Security) has outstanding
performance in this area [3].
In recent years, both the Chinese Academy of Sciences(CAS) and key universities and colleges in China are activelycarrying out researches in this area [4][5]. With the help of theDevelopment Project of National Key Basic Research and theMajor Projects Fund of CAS Knowledge Innovation Project,PhD Xu Jing, in Computing Center, Research Institute of HighEnergy Physics of CAS, made a preliminary implementationfor Intrusion Detection System based on data mining. With the
help of the National Natural Science Fund, doctoral students ofDepartment of Computer Science at Nanjing University ofScience, Wuhan University, Northern Jiaotong University, andother key universities carried out similar researches.
By analyzing the characteristics of hacker programs withback door, by which hackers control target hosts, networks
may cause unexpected connection records. Because of hugeamount of data in the network processing, the number ofconnection records after filtered is also very impressive. Whileestablishing a connection, it will increase a record. Therefore,we can not simply compare the connection records to achieveintrusion detection.
In recent years, the use of data mining knowledge forintrusion detection system has won more and more attention,
but there are a lot of problems. For examples, it is difficult tohave a clear standard in the selection of test data, there arelarge amounts of useless information in the results of miningout of the experiment data, and how we express the rules minedfrom the experiment data for intrusion detection system.
The remaining section of this paper is organized as follows.
In the second section, the paper describes the framework ofhybrid intrusion detection system. In the third section, we showthe experimental design and results of apriori algorithm in datamining. Finally, we draw a conclusion and exhibit a prospect.
II. THE FRAMEWORK OF HYBRID IDS
Intrusion detection technology is a new security supportmechanism, and monitors the network system without affectingthe network performance to prevent internal and externalattacks and misuse. Intrusion detection systems have a varietyof classifications. In accordance with the objects of the systemdetection, they are divided into the host-based, the network-
based, and the hybrid IDS; in accordance with systemarchitecture, they are divided into centralized and distributedIDS; and finally in accordance with the detection type, they can
be divided into anomaly-based model and misuse-based modelIDS.
The hybrid IDS in the paper is a combination of intrusiondetection engines of misuse and anomaly detection, uses datamining algorithms as the data processing for vast amounts ofsecurity audit data, and generates detection models and testmodels separately from the network data and host system calls,as shown in Fig. I:
2010 Second International Workshop on Education Technology and Computer Science
978-0-7695-3987-4/10 $26.00 2010 IEEE
DOI 10.1109/ETCS.2010.478
339
-
8/11/2019 Analysis and Design for Intrusion Detection System IDS Using Data Mining IEEE 2010
2/4
Network Sensor
Figure 1. The hybrid IDS based data mining algorithms
The hybrid IDS consists of four parts: data warehouse,sensors, analysis engine and alarm system.
A. Data warehouseData warehouse technology has the following functions: to
manage decision-making process, subject-oriented, integrated,and time-related data collection, to support multi-process andmulti-threading technology. Many commercial DBMS have thefunction. The project uses SQL Server 2005 data warehousetechnology, which includes Analysis Services. It can easily setup a data warehouse, achieve distributed computing, and
provide OLE DB Controls and ADO (ActiveX Data Objects)technology, and has a flexible data model, etc. Obviously, thefeatures can improve the speeds of the data mining and theanalysis engine.
Data warehouse technology is beneficial that the differentcomponents asynchronously handle the same piece of data
stored in a database. Therefore, it is the heart of the data andmodels in the whole system.
B. Sensors
Sensors are closely related with the network operatingsystem, usually to discuss Windows system or UNIX/Linuxsystems. This paper sets out technical means of sensors as anexample of Windows system.
1) Host Sensors
They gather information in monitored hosts with a varietyof methods, such as application logs, security logs and eventlogs, running applications and registry changes.
After set up audit features, Windows Server will monitorvarious states of the system, and write them to logs. With thehelp of Windows API functions, we develop programs tomonitor the system logs, running applications and registrychanges, and to send them to the host sensor manager of theanalysis engine to be analyzed.
We use the hook function to intercept API calls.Hook is an important technology of Windows message
processing mechanism. With installing a variety of hooks, theapplication can set the appropriate subroutines to monitor the
system messaging. Before messages reach their destinations,subroutines intercept them and make some analysis accordingto the user requirement. Hook is divided into thread-specific
hooks and global hooks. Thread-specific hooks monitor thespecified thread, and the global hooks monitor all the threads inthe system. For the global hooks, hook functions must beincluded in a separate dynamic-link library (DLL) so that theycan be called by a variety of associated applications.
Hook function is a mechanism for application programs tomonitor message flows and to process some type of themessages that have not yet reached the purpose window in thesystem. For example:
The process installs a hook WH_GETMESSAGE to checkeach window message in the system. It can install a hook bycalling SetWindowsHookEx function as following:
HHOOK hHook = SetWindowsHookEx
(WH_GETMESSAGE, GetMsgProc, hinstDLL, 0);
Where parameters WH_GETMESSAGE indicates the typeof hook to be installed, GetMsgProc indicates the functionaddress of system call while the window deals with themessage, and hinstDLL indicates the specified DLL thatcontains GetMsgProc function.
2) Network Sensors
With the netstat tool of Windows system, network sensorscollect the network connection information established
between computers. Netstat command can collect all the openport information on the computers. We may design a programto run netstat command at a regular interval, and to output theresults. But this way will add to the burden on the system. In a
relatively busy system, the records of a day may go up to someGB in size.
Therefore, we can optimize the program to capture easilythe network connection information. It first lists all open ports,monitors the port whether it is a new open and when it isclosed, records the port information only updated, and outputsrecords to Network Sensor Manager in Analysis Engine. Therecords include port services, port number, activation time, andtime stamp and so on.
DataWarehouse
Host Sensor
Alarm System
Alarm
Manager
Intruder Tracing
System Protection
Strategy
Archive Information
Alarm StrategyNetwork Sensor Manager
Host Sensor Manager
PatternMining
Mining AlgorithmLibrary
Misuse
Detector
Sensor-1 Sensor-2 Sensor-m
Analysis Engine
Alarm
Message
Alarm
Message
Anomaly
Detector
Sensor-1 Sensor-2 Sensor-n
340
-
8/11/2019 Analysis and Design for Intrusion Detection System IDS Using Data Mining IEEE 2010
3/4
C. Analysis Engine
Analysis Engine consists of three parts: Network/HostSensor Manager, Misuse and Anomaly Detector, MiningAlgorithm Library and Pattern Mining.
1) Sensor Manager receives data from sensors, then
analyse the data, translate them into the form of database
records, and store them into the data warehouse.
2) Misuse and Anomaly Detection detects intrusions based
on the matching patterns stored in the data warehouse.
Traditional IDS is divided into two separate types: misusedetection and anomaly detection. Anomaly detection is knownas behavior-based detection, which sets up the behavioralmodels for users under normal circumstances in the learning
phase, then compares the current user behavior with theexisting behavioral models, and founds an intrusion if thedeviation is greater than the threshold of the credibility. The
basic principle is that intrusion comes out if any behavior is notconsistent with the known behaviors.
Misuse detection is also called knowledge-based intrusion
detection, which sets up intrusion patterns for the knownintrusions, then matches the current user behaviors and systemstatus with the existing intrusion behavior patterns. The basic
principle is that intrusion comes out if any behavior isconsistent with the known behaviors.
We integrate these two models into the hybrid IDS, thusformat new basic principles of intrusion detection: any
behavior is a normal behavior if it is consistent with normalbehavior model, any behavior is a intrusion behavior if it isconsistent with anomaly behavior model, and others are addedto the detection models in data warehouse by the PatternMining module based on Mining Algorithm Library to generatea new detection model. While comparing an unknown behaviorwith normal/anomaly behavior model, the detectors determine
a normal/anomaly behavior by comparing support andconfidence level of calculated results with a given minimumsupport and confidence level.
3) Mining Algorithm Library and Pattern Mining for
mining unknown intrusions.
Point of view from the data warehouse, data mining can beregarded as an advanced stage of online analytical processing(OLAP). We apply data mining technology to IDS, use itsalgorithms of association analysis and sequential patternanalysis to extract safety-related characteristic properties,generate classification models based on them, and identifyautomatically security incidents. The analytical methods ofdata mining can be divided into three parts:
a) Association analysis
Its purpose is to uncover hidden relationships among thedata. Based on correlation among a set of items, you can usethe association analysis to identify the correlation betweenintrusion behaviors.
Here are the basic algorithms of association analysis:
Set I=(i1, i2, ..., im) is a collection of binary words in whichthe elements are referred to as item. Assume D as a collection
of transaction T, which is a collection of items, and TI.
Assume X is a collection of items in I, if XT, thereforetransaction T contains X.
An associational rule is an implication form like XY,
where XI, YI, andXY=. The support of rule XY inthe transaction D is the ratio of the number of transactions
contained X and Y in a transaction set to the number of alltransactions, denoted by Support (XY), that is:
Support(XY)=|{T: XYT, TD}|D|
The confidence level of rule XY in the transaction D isthe ratio of the number of transactions contained X and Y in atransaction set to the number of transactions contained X,
denoted by Confidence (XY), that is:
Confidence(XY)=|{T: XYT, TD}||{T: XT, TD}|
Given a transaction set D, the tasks of association analysisare to create the associational rules that support and confidencelevel from mining data are respectively greater than theminimum support (minsupp) and the minimum confidence
(minconf) given by the users.Agrawal and et al in 1993, designed a basic algorithm
(Apriori). In recent years, the algorithm has been madeconsiderable progress. The project applied the latest algorithmsfor pattern mining.
b) Sequence pattern analysis
Similar to the association analysis, its purpose is to uncoverrelationships among the data. But its focus is on analysis ofcontext among the data. Many behaviors of hacker intrusionshave context, and some actions must occur after others. Forexample: a hacker generally scans the system port before attack.
c) Classification analysis
Assume record collection and a set of tags, where tag is agroup of categories with different characteristics. We give a tagfor each record, that is, to classify records by tags. Then wecheck the tagged records, and describe their characteristics. Forexample, the intrusions are divided into three categories basedon harmful levels of hacking: the fatal intrusion, the generalintrusion, and the weak intrusion. Classification analysischecks the previous hacking, classifies each risky level, andthen gives their descriptions according to classificationstandards.
Bayesian classification algorithm is as following:
Each connection record is described with an n-dimensionalfeature vector X=(x1, x2, ..., xn), where the n attributes,respectively, describe characteristics of n-connected records.
Assume that there are m categories C1, C2, ..., Cm. Givenan unknown connection record X (or no tag), classification
predicts that X is the highest category of posterior probability,namely, Bayesian classifier assigns unknown connection
records d to the category Ci, if and only if P(Ci|X)P(Cj|X), 1 j m, j i. According to Bayesian,P(Ci|X)=P(X|Ci)P(C i)/P(X).
For any category, P(X) is a constant, we can get the greatest
341
-
8/11/2019 Analysis and Design for Intrusion Detection System IDS Using Data Mining IEEE 2010
4/4
value of P(X|Ci)P(C i). The priori probability of category isP(Ci)=si/s, where s iis the number of connection records in thecat
nd s iis the number of connection records in thecat
, if and only if P(X|Ci)P(CP(X
rithms in this project toimprove the performance of IDS.
D.
, archiving, intrusion tracingwhen necessary. Here omitted.
III. THE EXPERIMENT FOR ASSOCIATION ANALYSIS
A.
s and the detecting phase of
o be suitable for detection rule set while
rule is that of the
lidity has a direct impact onthe accuracy of detection results.
B.
from the file of network pacrec
a text file, in which thvar
efore, we have to
which the right items does not
which the left items does
more ideal. The few associational rulesare as Table I showed.
TABLE I. ASSOCIATIONAL
rulesite s
S t Co ce
egory Ci, s is the total number of connection records.
For calculation P(X|Ci), in order to reduce overhead, giventhe assumption condition of category independence, so thatP(X|Ci)=P(Xk|Ci), (k=1,,n), where P(Xk|Ci)=sik/si, s ikis thenumber of connection records that has the value of X
k in the
category Ci, aegory Ci.
In order to classify the unknown connections, for eachcategory Ci, we calculate P(X|Ci)P(C i), to assign connectionrecords X to category Ci i)>
not contain the IP and Ports.
After the above steps of filtration, we get the finalassociational rules to be|Cj)P(Cj), 1jm, ji.
Although the algorithms adapt to different scenes, wecomprehensively use these algo
Alarm system
The main functions of alarm system are to build the
emergency measures based on alarm strategies, such as theappropriate system protection
The design of associational rule detector
Association analysis in data mining is divided into twoparts: the learning phase of the rulethe application of the rules learnt.
1) In the learning phase: the Analysis Engine applys
association analysis to connection records from Network
Sensor Manager, to mine out the associations between the
values of data items under the normal state of networks, and
obtaine the associational rule set, which are filtered by some
artificial rules so as t
detecting intrusions.
2) In the detecting phase: the Analysis Engine gets the
connection records from Network Sensor Manager, and
matches with detection rule set to determine whether intrusion
takes place. The process matching detection
association analysis in the detecting phase.The detection rule set made in the learning phase is the core
of the Analysis Engine. Their va
The experimental results for associational rules
Our experiment data are ketsorded by TCPdump tool.
We compile the network packets to the format of theconnection records, save them as e
iables are separated by a space.
Association analysis of data mining builds up the rule setsfrom the connection records, where the minimum support is set
to 5%, and the minimum confidence is set to 100%. But thereare a large number of useless rules in the rule sets. They cannot be used simply to express the meaningful associations
between the values of connection attributes. If we use them as astandard for monitoring the network intrusions, the decisions ofthe system would be misdetections. Therremove the useless rules, as the following:
To filter out the rules incontain the categories;
Then to filter out the rules in
RULES
upporThe left items of
The right
ms of rule (%)
nfiden
(%)
192. 168. 7. 13 80 sf normal 7. 8 100. 0
192. 168. 4. 16 25 passive exter normal 8. 3 100. 0
192. 168. 2. 10 80 active normal 23. 2 100. 0
192. 168. 4. 18 tcp 25 sf normal 8. 5 100. 0
192. 168. 7. 23 80 a 19. 4 100. 0active norm l
IV. CONCLUSIONS
The hybrid IDS is efficient to detect known and unknownintrusions. The research on intrusion detections based on datamining is one of the hot study topics at home and abroad.There are still a series of theoretical and practical problems to
be resolved, and a number of key technologies are required tomake further deep study. The experiment shows that the designand implementation of an efficient and accurate IDS based ondat
representativeoriginal data and to filter precisely useless rules.
enceFoundation of Zhejiang Province, China (No. Y1080343).
the 7th USENIX Security Symposium, San
9 IEEE
from
Computer Engineering, Beijing.2002, 28(6), pp9-10,169
a mining is a large, complex project.
In the application of the data mining algorithms to original
connection records, how to effectively get the correspondingfrequent patterns is the key to study. In the future, we willfocus the study on how to select appropriate and
ACKNOWLEDGMENT
The work has been supported by the Natural Sci
REFERENCES
[1] W. Lee and S. J. Stolfo. Data mining approaches for intrusiondetection, In Proceedings ofAntonio, TX, January 1998.
[2] W. Lee and S. J. Stolfo. A data mining framework for building
intrusion detection models, In Proceedings of the 199Symposium on Security and Privacy, Oakland, CA, May 1999
[3] http://www.sans.org/resources/idfaq/data_mining.php?printer=Y,2003.4
[4] Chinese Academy of Sciences (CAS). Network IDS technology in CASreached the international advanced level, in Chinese. Retrievedhttp://www.cas.cn/jzd/jcx/jcxlc/200204/t20020403_1034832.shtml
[5] Xu Jing, Liu Baoxu and Xu Rongsheng. Design and implementation ofdata mining-based IDS, in Chinese,
342