Intrusion Detection and Malware Analysis - Anomaly-based IDS€¦ · urg ptr options src port dst...

Intrusion Detection and Malware AnalysisAnomaly-based IDS

Pavel LaskovWilhelm Schickard Institute for Computer Science

Taxonomy of anomaly-based IDS

Features:Packet headersByte streamsSyntactic events

MethodsManual modeling: no training, model constructed by handLearning from clean data: training requires attack-free dataLearning from noisy data: training is required butcontamination with attacks can be toleratedOnline learning: no training, detection “on the fly” after a shortinitialization period.

Anomaly-based IDS to be discussed

PHAD: packet header features, learning from clean dataALAD: TCP connection features and application keywords,learning from clean dataPAYL: content byte stream features, learning from clean data(Re)MIND: content byte stream and structured features,learning from noisy data, online learning

PHAD features

Ethernet IP TCP UDP ICMPsizedest hidest losrc hisrc loproto

hlentoslenfrag idfrag ptrttlprotochecksumsrc addrdest addr

src portdst portseq numack numhlenflagswsizechecksumurg ptroptions

src portdst portlenchecksum

typecodechecksum

PHAD algorithm

Probability of novel feature valuesSuppose an attribute x has previously assumed r values in nobservations (r < n). Then the probability that the next valueof x is different from previously seen values is approximately r

n .

Computation of the anomaly scoreFor each attribute x in the training data compute the ratio r

n .For every new packet p, let Q be a set of attributes whosevalues were not seen in the training data. Let ti be the timesince the last anomalous value has been seen in the attributei. Then the anomaly score of the packet p is:

s(p) = ∑i∈Q

tiniri

PHAD example

Training data:{[A1

],[

B2

],[

B3

],[

C4

],[

A1

],[

A2

],[

C3

],[

C4

]}rn -values: {

3/81/2

}Test data: {[

A1

],[

D2

],[

B3

],[

E5

]}Score for the last packet:

s(p4) = 2 · 8/3 + 4 · 2 = 13.33

Feature clustering in PHAD

Problem: each PHAD feature is 4 byte long:232 values are possibleStoring r

n ratios for every possible value is infeasible

Feature clustering:Put a cap C on the number of values to be considered for eachfeature (e.g. C = 32).Every time this cap is exceeded, merge two nearest values (orranges) into a new range. e.g. (for C = 3):

{3-5, 8, 10-15, 20}→ {3-5, 8-15, 20}

A value of an attribute of a packet p is considered unseen if itfalls outside of known ranges.

PHAD resume

Simple algorthm for anomaly scoring(+) High performance (ca. 75,000 packets/s)(+) Easy to implement (ca. 400 lines of C++ code)(−) Clean training data required(−−) Poor detection rates: 28% on the DARPA 1999 dataset

Packet header features:(+) Easy to extract(−) Are not useful for detection of real exploits

ALAD features

P(src ip|dst ip): a set of client hosts for each hostP(src ip|dst ip, dst port): a set of client hosts for each serviceand each host.P(dst ip, dst port): a set of normal servers on a site (unusualvalues may be indicative of probes)P(tcp flags|dst port): usual patterns of TCP connectioninitiation and closure for each service.P(keywords|dst port): typical application-level keywords foreach service.

ALAD algorithm

Detection is carried out at a connection level.For conditional features, the r

n models are computed for eachpre-conditioner (i.e. dst ip, dst port or a {dst ip, dst port} pair).For the joint feature, the count r is computed for all pairs ofvalues divided by the total number of connection.No feature clustering is performed.

ALAD resume

Similar algorithm and same problems as PHADLimited application-level features slightly improve detectionof “difficult” attacks in the DARPA 1999 dataset.

PAYL: payload byte sequence analysis

Motivation: detect anomalous packet payloads.

Problem 1: how can one define a “normal packet payload”?Problem 2: how can one measure the degree of anomality ofa packet payload?


Motivation: detect anomalous packet payloads.Problem 1: how can one define a “normal packet payload”?

Problem 2: how can one measure the degree of anomality ofa packet payload?


Motivation: detect anomalous packet payloads.Problem 1: how can one define a “normal packet payload”?Problem 2: how can one measure the degree of anomality ofa packet payload?

PAYL features: raw byte histograms

A byte histogram is an array of size 256 measuring frequencyof all possible byte values in a given packet payload.Examples:

Figure 1 provides an example showing how the payload byte distributions vary from port to port, and from source and destination flows. Each plot represents the characteristic profile for that port and flow direction (inbound/outbound). Notice also that the distributions for ports 22 (inbound and outbound) show no discernible pattern, and hence the statistical distribution for such encrypted channels would entail a more uniform frequency distribution across all of the 256 byte values, each with low variance. Hence, encrypted channels are fairly easy to spot. Notice that this figure is actually generated from a dataset with only the first 96 bytes of payload in each packet, and there is already a very clear pattern with the truncated payload. Figure 2 displays the variability of the frequency distributions among different length payloads. The two plots characterize two different distributions from the incoming traffic to the same web server, port 80 for two different lengths, here payloads of 200 bytes, the other 1,460 bytes. Clearly, a single monolithic model for both length categories will not represent the distributions accurately.

Fig. 1. Example byte distributions for different ports. For each plot, the X-axis is the ASCII byte 0-255, and the Y-axis is the average byte frequency

Fig. 2. Example byte distribution for different payload lengths for port 80 on the same host server







port 22 (SSH) port 25 (SMTP) port 80 (HTTP)

Advantages of byte histograms:Computation of mean and standard deviationComputation of a distance between two histograms

PAYL algorithm

TrainingFor each observed packet length, compute average bytehistograms with standard deviations.Cluster average models for neighboring packet lengths bymerging two histograms if their distance is less than somepre-defined threshold t.

Anomaly detectionFor each new packet p compute its distance from the normalprofile for the corresponding packet length (cluster):

s(p) =256

∑i=1

|fi(p)− fi(p̄)|σi(p̄)

PAYL resume

Anomaly detection over packet payload byte sequencies.Reasonable accuracy: over 50% detection rate for remoteattacks at 1% FP rate.Simple detection algorithm: computationally efficient.Drawbacks:

Packet mode operation: evasion problems!Primitive data structures: only single-byte histograms possible.Clean training data required.

MIND: machine learning IDS

Motivation:Learning from contaminated dataImprovement of accuracy: good detection at extremely smallFP rates needed!Incorporation of semantic structure using advanced learningmodels.

Problem 1: How can one compare packets (connections) in amore structural way than simple histograms?Problem 2: How can one learn in the presence of attacks intraining data?



Problem 1: How can one compare packets (connections) in amore structural way than simple histograms?

Problem 2: How can one learn in the presence of attacks intraining data?



Problem 1: How can one compare packets (connections) in amore structural way than simple histograms?Problem 2: How can one learn in the presence of attacks intraining data?

Learning from clean data

Training: find a smallest enclosingsphere

minR,c

R2

s.t. ||xi − c||2 ≤ R2, i = 1, . . . , M.

Detection: compute a distance fromthe center

||x− c||2 > R⇒ alarm

||x− c||2 ≤ R⇒ normal

cR

Learning from contaminated data

Training: soften constraints using slackvariables.

minR,c

R2+ηM

∑i=0

ξi,

s.t. ||xi − c||2 ≤ R2+ξi, i = 1, . . . , M,ξi ≥ 0

Constant η controls the acceptablenoise rate in the training data.Detection: compute a distance fromthe center.

c R



minR,c

R2+ηM

∑i=0

ξi,

s.t. ||xi − c||2 ≤ R2+ξi, i = 1, . . . , M,ξi ≥ 0

Constant η controls the acceptablenoise rate in the training data.

Detection: compute a distance fromthe center.

c R



minR,c

R2+ηM

∑i=0

ξi,

s.t. ||xi − c||2 ≤ R2+ξi, i = 1, . . . , M,ξi ≥ 0

Constant η controls the acceptablenoise rate in the training data.Detection: compute a distance fromthe center.

c R

Beyond the mathematical abstraction

How do we apply this geometric intuitionto network security?

Interesting observation: the onlyoperation on data involved in thelearning approach is computation ofsimilarity between two points: ||x− c||.If we can compute similarity between apair of observed network events, anylearning algorithm can be plugged in!


How do we apply this geometric intuitionto network security?Interesting observation: the onlyoperation on data involved in thelearning approach is computation ofsimilarity between two points: ||x− c||.

If we can compute similarity between apair of observed network events, anylearning algorithm can be plugged in!


How do we apply this geometric intuitionto network security?Interesting observation: the onlyoperation on data involved in thelearning approach is computation ofsimilarity between two points: ||x− c||.If we can compute similarity between apair of observed network events, anylearning algorithm can be plugged in!

Embedding of sequences in metric spaces

Sequences

1. blabla blubla blablabu aa

2. bla blablaa bulab bb abla

3. a blabla blabla ablub bla

4. blab blab abba blabla blu

Geometry

1

2 3

4

Subsequences

Features

Histograms ofsubsequences

a b aa

bb

bla

blu

ab

ba

ab

la

bla

b

ab

lub

bu

lab

bla

bla

bla

blu

bla

bla

a

bla

bla

bu

1.

2.

3.

4.

Embedding example

X = abrakadabraY = barakobama

X Y X · Ya/5 a/4 20b/2 b/2 4d/1k/1 k/1 1

m/1o/1

r/2 r/1 25.92 4.90 27

∠ XY = 21.5◦

X Y X · Yab/2ad/1ak/1 ak/1 1

am/1ar/1ba/2

br/2da/1ka/1

ko/1ma/1ob/1

ra/2 ra/1 24.00 3.46 3

∠ XY = 77.5◦

Experimental evaluation of MIND

Data:3 weeks of HTTP-Traffic at FIRST (622,734 HTTP requests)120 attacks generated by Metasploit

Quality measure: Receiver Operating Characteristic (ROC)

0.0001 0.001 0.010

0.2

0.4

0.6

0.8

1

False−positive rate

Tru

e−po

sitiv

e ra

te

ReMINDSnort IDS

0.0001 0.001 0.010

0.2

0.4

0.6

0.8

1

False−positive rate

Tru

e−po

sitiv

e ra

te

ReMINDSSADAnagramTokengram

MIND resume

Anomaly detection over connection payloadsExcellent accuracy: over 90% detection with no false alarmsLearning in the presense of attacks.Sophisticated data structures.Drawbacks:

Computationally more involved: efficient data structures andfine-tuning are required.Delay in detection until completion of TCP connections:incremental processing is needed.

Lessons learned

Modern anomaly detection methods can achieve superiordetection accuracy to signature-based IDS with low falsealarm rates.Anomaly detection methods must be equipped for learningfrom contaminated date (or be able to automatically cleandata)Efficient data structures for feature extraction are the key tosuccess of anomaly detection.

Recommended reading

M. Mahoney and P. Chan.Learning nonstationary models of normal network traffic for detecting novelattacks.In Proc. of ACM SIGKDD International Conference on Knowledge Discoveryand Data Mining (KDD), pages 376–385, 2002.

M. Mahoney and P. Chan.Learning rules for anomaly detection of hostile network traffic.In Proc. of International Conference on Data Mining (ICDM), 2003.

K. Rieck and P. Laskov.Language models for detection of unknown attacks in network traffic.Journal in Computer Virology, 2(4):243–256, 2007.

K. Wang and S. Stolfo.Anomalous payload-based network intrusion detection.In Recent Adances in Intrusion Detection (RAID), pages 203–222, 2004.

Intrusion Detection and Malware Analysis - Anomaly-based IDS€¦ · urg ptr options src port dst...

Documents

Transcript of Intrusion Detection and Malware Analysis - Anomaly-based IDS€¦ · urg ptr options src port dst...