Intrusion Detection and Malware Analysis - Anomaly-based IDS€¦ · urg ptr options src port dst...

37
Intrusion Detection and Malware Analysis Anomaly-based IDS Pavel Laskov Wilhelm Schickard Institute for Computer Science

Transcript of Intrusion Detection and Malware Analysis - Anomaly-based IDS€¦ · urg ptr options src port dst...

Page 1: Intrusion Detection and Malware Analysis - Anomaly-based IDS€¦ · urg ptr options src port dst port len checksum type code checksum. PHAD algorithm Probability of novel feature

Intrusion Detection and Malware AnalysisAnomaly-based IDS

Pavel LaskovWilhelm Schickard Institute for Computer Science

Page 2: Intrusion Detection and Malware Analysis - Anomaly-based IDS€¦ · urg ptr options src port dst port len checksum type code checksum. PHAD algorithm Probability of novel feature

Taxonomy of anomaly-based IDS

Features:Packet headersByte streamsSyntactic events

MethodsManual modeling: no training, model constructed by handLearning from clean data: training requires attack-free dataLearning from noisy data: training is required butcontamination with attacks can be toleratedOnline learning: no training, detection “on the fly” after a shortinitialization period.

Page 3: Intrusion Detection and Malware Analysis - Anomaly-based IDS€¦ · urg ptr options src port dst port len checksum type code checksum. PHAD algorithm Probability of novel feature

Anomaly-based IDS to be discussed

PHAD: packet header features, learning from clean dataALAD: TCP connection features and application keywords,learning from clean dataPAYL: content byte stream features, learning from clean data(Re)MIND: content byte stream and structured features,learning from noisy data, online learning

Page 4: Intrusion Detection and Malware Analysis - Anomaly-based IDS€¦ · urg ptr options src port dst port len checksum type code checksum. PHAD algorithm Probability of novel feature

PHAD features

Ethernet IP TCP UDP ICMPsizedest hidest losrc hisrc loproto

hlentoslenfrag idfrag ptrttlprotochecksumsrc addrdest addr

src portdst portseq numack numhlenflagswsizechecksumurg ptroptions

src portdst portlenchecksum

typecodechecksum

Page 5: Intrusion Detection and Malware Analysis - Anomaly-based IDS€¦ · urg ptr options src port dst port len checksum type code checksum. PHAD algorithm Probability of novel feature

PHAD algorithm

Probability of novel feature valuesSuppose an attribute x has previously assumed r values in nobservations (r < n). Then the probability that the next valueof x is different from previously seen values is approximately r

n .

Computation of the anomaly scoreFor each attribute x in the training data compute the ratio r

n .For every new packet p, let Q be a set of attributes whosevalues were not seen in the training data. Let ti be the timesince the last anomalous value has been seen in the attributei. Then the anomaly score of the packet p is:

s(p) = ∑i∈Q

tiniri

Page 6: Intrusion Detection and Malware Analysis - Anomaly-based IDS€¦ · urg ptr options src port dst port len checksum type code checksum. PHAD algorithm Probability of novel feature

PHAD example

Training data:{[A1

],[

B2

],[

B3

],[

C4

],[

A1

],[

A2

],[

C3

],[

C4

]}rn -values: {

3/81/2

}Test data: {[

A1

],[

D2

],[

B3

],[

E5

]}Score for the last packet:

s(p4) = 2 · 8/3 + 4 · 2 = 13.33

Page 7: Intrusion Detection and Malware Analysis - Anomaly-based IDS€¦ · urg ptr options src port dst port len checksum type code checksum. PHAD algorithm Probability of novel feature

Feature clustering in PHAD

Problem: each PHAD feature is 4 byte long:232 values are possibleStoring r

n ratios for every possible value is infeasible

Feature clustering:Put a cap C on the number of values to be considered for eachfeature (e.g. C = 32).Every time this cap is exceeded, merge two nearest values (orranges) into a new range. e.g. (for C = 3):

{3-5, 8, 10-15, 20}→ {3-5, 8-15, 20}

A value of an attribute of a packet p is considered unseen if itfalls outside of known ranges.

Page 8: Intrusion Detection and Malware Analysis - Anomaly-based IDS€¦ · urg ptr options src port dst port len checksum type code checksum. PHAD algorithm Probability of novel feature

Feature clustering in PHAD

Problem: each PHAD feature is 4 byte long:232 values are possibleStoring r

n ratios for every possible value is infeasible

Feature clustering:Put a cap C on the number of values to be considered for eachfeature (e.g. C = 32).Every time this cap is exceeded, merge two nearest values (orranges) into a new range. e.g. (for C = 3):

{3-5, 8, 10-15, 20}→ {3-5, 8-15, 20}

A value of an attribute of a packet p is considered unseen if itfalls outside of known ranges.

Page 9: Intrusion Detection and Malware Analysis - Anomaly-based IDS€¦ · urg ptr options src port dst port len checksum type code checksum. PHAD algorithm Probability of novel feature

PHAD resume

Simple algorthm for anomaly scoring(+) High performance (ca. 75,000 packets/s)(+) Easy to implement (ca. 400 lines of C++ code)(−) Clean training data required(−−) Poor detection rates: 28% on the DARPA 1999 dataset

Packet header features:(+) Easy to extract(−) Are not useful for detection of real exploits

Page 10: Intrusion Detection and Malware Analysis - Anomaly-based IDS€¦ · urg ptr options src port dst port len checksum type code checksum. PHAD algorithm Probability of novel feature

ALAD features

P(src ip|dst ip): a set of client hosts for each hostP(src ip|dst ip, dst port): a set of client hosts for each serviceand each host.P(dst ip, dst port): a set of normal servers on a site (unusualvalues may be indicative of probes)P(tcp flags|dst port): usual patterns of TCP connectioninitiation and closure for each service.P(keywords|dst port): typical application-level keywords foreach service.

Page 11: Intrusion Detection and Malware Analysis - Anomaly-based IDS€¦ · urg ptr options src port dst port len checksum type code checksum. PHAD algorithm Probability of novel feature

ALAD algorithm

Detection is carried out at a connection level.For conditional features, the r

n models are computed for eachpre-conditioner (i.e. dst ip, dst port or a {dst ip, dst port} pair).For the joint feature, the count r is computed for all pairs ofvalues divided by the total number of connection.No feature clustering is performed.

Page 12: Intrusion Detection and Malware Analysis - Anomaly-based IDS€¦ · urg ptr options src port dst port len checksum type code checksum. PHAD algorithm Probability of novel feature

ALAD resume

Similar algorithm and same problems as PHADLimited application-level features slightly improve detectionof “difficult” attacks in the DARPA 1999 dataset.

Page 13: Intrusion Detection and Malware Analysis - Anomaly-based IDS€¦ · urg ptr options src port dst port len checksum type code checksum. PHAD algorithm Probability of novel feature

PAYL: payload byte sequence analysis

Motivation: detect anomalous packet payloads.

Problem 1: how can one define a “normal packet payload”?Problem 2: how can one measure the degree of anomality ofa packet payload?

Page 14: Intrusion Detection and Malware Analysis - Anomaly-based IDS€¦ · urg ptr options src port dst port len checksum type code checksum. PHAD algorithm Probability of novel feature

PAYL: payload byte sequence analysis

Motivation: detect anomalous packet payloads.Problem 1: how can one define a “normal packet payload”?

Problem 2: how can one measure the degree of anomality ofa packet payload?

Page 15: Intrusion Detection and Malware Analysis - Anomaly-based IDS€¦ · urg ptr options src port dst port len checksum type code checksum. PHAD algorithm Probability of novel feature

PAYL: payload byte sequence analysis

Motivation: detect anomalous packet payloads.Problem 1: how can one define a “normal packet payload”?Problem 2: how can one measure the degree of anomality ofa packet payload?

Page 16: Intrusion Detection and Malware Analysis - Anomaly-based IDS€¦ · urg ptr options src port dst port len checksum type code checksum. PHAD algorithm Probability of novel feature

PAYL features: raw byte histograms

A byte histogram is an array of size 256 measuring frequencyof all possible byte values in a given packet payload.Examples:

Figure 1 provides an example showing how the payload byte distributions vary from port to port, and from source and destination flows. Each plot represents the characteristic profile for that port and flow direction (inbound/outbound). Notice also that the distributions for ports 22 (inbound and outbound) show no discernible pattern, and hence the statistical distribution for such encrypted channels would entail a more uniform frequency distribution across all of the 256 byte values, each with low variance. Hence, encrypted channels are fairly easy to spot. Notice that this figure is actually generated from a dataset with only the first 96 bytes of payload in each packet, and there is already a very clear pattern with the truncated payload. Figure 2 displays the variability of the frequency distributions among different length payloads. The two plots characterize two different distributions from the incoming traffic to the same web server, port 80 for two different lengths, here payloads of 200 bytes, the other 1,460 bytes. Clearly, a single monolithic model for both length categories will not represent the distributions accurately.

Fig. 1. Example byte distributions for different ports. For each plot, the X-axis is the ASCII byte 0-255, and the Y-axis is the average byte frequency

Fig. 2. Example byte distribution for different payload lengths for port 80 on the same host server

Figure 1 provides an example showing how the payload byte distributions vary from port to port, and from source and destination flows. Each plot represents the characteristic profile for that port and flow direction (inbound/outbound). Notice also that the distributions for ports 22 (inbound and outbound) show no discernible pattern, and hence the statistical distribution for such encrypted channels would entail a more uniform frequency distribution across all of the 256 byte values, each with low variance. Hence, encrypted channels are fairly easy to spot. Notice that this figure is actually generated from a dataset with only the first 96 bytes of payload in each packet, and there is already a very clear pattern with the truncated payload. Figure 2 displays the variability of the frequency distributions among different length payloads. The two plots characterize two different distributions from the incoming traffic to the same web server, port 80 for two different lengths, here payloads of 200 bytes, the other 1,460 bytes. Clearly, a single monolithic model for both length categories will not represent the distributions accurately.

Fig. 1. Example byte distributions for different ports. For each plot, the X-axis is the ASCII byte 0-255, and the Y-axis is the average byte frequency

Fig. 2. Example byte distribution for different payload lengths for port 80 on the same host server

Figure 1 provides an example showing how the payload byte distributions vary from port to port, and from source and destination flows. Each plot represents the characteristic profile for that port and flow direction (inbound/outbound). Notice also that the distributions for ports 22 (inbound and outbound) show no discernible pattern, and hence the statistical distribution for such encrypted channels would entail a more uniform frequency distribution across all of the 256 byte values, each with low variance. Hence, encrypted channels are fairly easy to spot. Notice that this figure is actually generated from a dataset with only the first 96 bytes of payload in each packet, and there is already a very clear pattern with the truncated payload. Figure 2 displays the variability of the frequency distributions among different length payloads. The two plots characterize two different distributions from the incoming traffic to the same web server, port 80 for two different lengths, here payloads of 200 bytes, the other 1,460 bytes. Clearly, a single monolithic model for both length categories will not represent the distributions accurately.

Fig. 1. Example byte distributions for different ports. For each plot, the X-axis is the ASCII byte 0-255, and the Y-axis is the average byte frequency

Fig. 2. Example byte distribution for different payload lengths for port 80 on the same host server

port 22 (SSH) port 25 (SMTP) port 80 (HTTP)

Advantages of byte histograms:Computation of mean and standard deviationComputation of a distance between two histograms

Page 17: Intrusion Detection and Malware Analysis - Anomaly-based IDS€¦ · urg ptr options src port dst port len checksum type code checksum. PHAD algorithm Probability of novel feature

PAYL algorithm

TrainingFor each observed packet length, compute average bytehistograms with standard deviations.Cluster average models for neighboring packet lengths bymerging two histograms if their distance is less than somepre-defined threshold t.

Anomaly detectionFor each new packet p compute its distance from the normalprofile for the corresponding packet length (cluster):

s(p) =256

∑i=1

|fi(p)− fi(p̄)|σi(p̄)

Page 18: Intrusion Detection and Malware Analysis - Anomaly-based IDS€¦ · urg ptr options src port dst port len checksum type code checksum. PHAD algorithm Probability of novel feature

PAYL resume

Anomaly detection over packet payload byte sequencies.Reasonable accuracy: over 50% detection rate for remoteattacks at 1% FP rate.Simple detection algorithm: computationally efficient.Drawbacks:

Packet mode operation: evasion problems!Primitive data structures: only single-byte histograms possible.Clean training data required.

Page 19: Intrusion Detection and Malware Analysis - Anomaly-based IDS€¦ · urg ptr options src port dst port len checksum type code checksum. PHAD algorithm Probability of novel feature

MIND: machine learning IDS

Motivation:Learning from contaminated dataImprovement of accuracy: good detection at extremely smallFP rates needed!Incorporation of semantic structure using advanced learningmodels.

Problem 1: How can one compare packets (connections) in amore structural way than simple histograms?Problem 2: How can one learn in the presence of attacks intraining data?

Page 20: Intrusion Detection and Malware Analysis - Anomaly-based IDS€¦ · urg ptr options src port dst port len checksum type code checksum. PHAD algorithm Probability of novel feature

MIND: machine learning IDS

Motivation:Learning from contaminated dataImprovement of accuracy: good detection at extremely smallFP rates needed!Incorporation of semantic structure using advanced learningmodels.

Problem 1: How can one compare packets (connections) in amore structural way than simple histograms?

Problem 2: How can one learn in the presence of attacks intraining data?

Page 21: Intrusion Detection and Malware Analysis - Anomaly-based IDS€¦ · urg ptr options src port dst port len checksum type code checksum. PHAD algorithm Probability of novel feature

MIND: machine learning IDS

Motivation:Learning from contaminated dataImprovement of accuracy: good detection at extremely smallFP rates needed!Incorporation of semantic structure using advanced learningmodels.

Problem 1: How can one compare packets (connections) in amore structural way than simple histograms?Problem 2: How can one learn in the presence of attacks intraining data?

Page 22: Intrusion Detection and Malware Analysis - Anomaly-based IDS€¦ · urg ptr options src port dst port len checksum type code checksum. PHAD algorithm Probability of novel feature

Learning from clean data

Training: find a smallest enclosingsphere

minR,c

R2

s.t. ||xi − c||2 ≤ R2, i = 1, . . . , M.

Detection: compute a distance fromthe center

||x− c||2 > R⇒ alarm

||x− c||2 ≤ R⇒ normal

cR

Page 23: Intrusion Detection and Malware Analysis - Anomaly-based IDS€¦ · urg ptr options src port dst port len checksum type code checksum. PHAD algorithm Probability of novel feature

Learning from clean data

Training: find a smallest enclosingsphere

minR,c

R2

s.t. ||xi − c||2 ≤ R2, i = 1, . . . , M.

Detection: compute a distance fromthe center

||x− c||2 > R⇒ alarm

||x− c||2 ≤ R⇒ normal

cR

Page 24: Intrusion Detection and Malware Analysis - Anomaly-based IDS€¦ · urg ptr options src port dst port len checksum type code checksum. PHAD algorithm Probability of novel feature

Learning from contaminated data

Training: soften constraints using slackvariables.

minR,c

R2+ηM

∑i=0

ξi,

s.t. ||xi − c||2 ≤ R2+ξi, i = 1, . . . , M,ξi ≥ 0

Constant η controls the acceptablenoise rate in the training data.Detection: compute a distance fromthe center.

c R

Page 25: Intrusion Detection and Malware Analysis - Anomaly-based IDS€¦ · urg ptr options src port dst port len checksum type code checksum. PHAD algorithm Probability of novel feature

Learning from contaminated data

Training: soften constraints using slackvariables.

minR,c

R2+ηM

∑i=0

ξi,

s.t. ||xi − c||2 ≤ R2+ξi, i = 1, . . . , M,ξi ≥ 0

Constant η controls the acceptablenoise rate in the training data.

Detection: compute a distance fromthe center.

c R

Page 26: Intrusion Detection and Malware Analysis - Anomaly-based IDS€¦ · urg ptr options src port dst port len checksum type code checksum. PHAD algorithm Probability of novel feature

Learning from contaminated data

Training: soften constraints using slackvariables.

minR,c

R2+ηM

∑i=0

ξi,

s.t. ||xi − c||2 ≤ R2+ξi, i = 1, . . . , M,ξi ≥ 0

Constant η controls the acceptablenoise rate in the training data.Detection: compute a distance fromthe center.

c R

Page 27: Intrusion Detection and Malware Analysis - Anomaly-based IDS€¦ · urg ptr options src port dst port len checksum type code checksum. PHAD algorithm Probability of novel feature

Beyond the mathematical abstraction

How do we apply this geometric intuitionto network security?

Interesting observation: the onlyoperation on data involved in thelearning approach is computation ofsimilarity between two points: ||x− c||.If we can compute similarity between apair of observed network events, anylearning algorithm can be plugged in!

Page 28: Intrusion Detection and Malware Analysis - Anomaly-based IDS€¦ · urg ptr options src port dst port len checksum type code checksum. PHAD algorithm Probability of novel feature

Beyond the mathematical abstraction

How do we apply this geometric intuitionto network security?Interesting observation: the onlyoperation on data involved in thelearning approach is computation ofsimilarity between two points: ||x− c||.

If we can compute similarity between apair of observed network events, anylearning algorithm can be plugged in!

Page 29: Intrusion Detection and Malware Analysis - Anomaly-based IDS€¦ · urg ptr options src port dst port len checksum type code checksum. PHAD algorithm Probability of novel feature

Beyond the mathematical abstraction

How do we apply this geometric intuitionto network security?Interesting observation: the onlyoperation on data involved in thelearning approach is computation ofsimilarity between two points: ||x− c||.If we can compute similarity between apair of observed network events, anylearning algorithm can be plugged in!

Page 30: Intrusion Detection and Malware Analysis - Anomaly-based IDS€¦ · urg ptr options src port dst port len checksum type code checksum. PHAD algorithm Probability of novel feature

Embedding of sequences in metric spaces

Sequences

1. blabla blubla blablabu aa

2. bla blablaa bulab bb abla

3. a blabla blabla ablub bla

4. blab blab abba blabla blu

Geometry

1

2 3

4

Subsequences

Features

Histograms ofsubsequences

a b aa

bb

bla

blu

ab

ba

ab

la

bla

b

ab

lub

bu

lab

bla

bla

bla

blu

bla

bla

a

bla

bla

bu

1.

2.

3.

4.

Page 31: Intrusion Detection and Malware Analysis - Anomaly-based IDS€¦ · urg ptr options src port dst port len checksum type code checksum. PHAD algorithm Probability of novel feature

Embedding example

X = abrakadabraY = barakobama

X Y X · Ya/5 a/4 20b/2 b/2 4d/1k/1 k/1 1

m/1o/1

r/2 r/1 25.92 4.90 27

∠ XY = 21.5◦

X Y X · Yab/2ad/1ak/1 ak/1 1

am/1ar/1ba/2

br/2da/1ka/1

ko/1ma/1ob/1

ra/2 ra/1 24.00 3.46 3

∠ XY = 77.5◦

Page 32: Intrusion Detection and Malware Analysis - Anomaly-based IDS€¦ · urg ptr options src port dst port len checksum type code checksum. PHAD algorithm Probability of novel feature

Embedding example

X = abrakadabraY = barakobama

X Y X · Ya/5 a/4 20b/2 b/2 4d/1k/1 k/1 1

m/1o/1

r/2 r/1 25.92 4.90 27

∠ XY = 21.5◦

X Y X · Yab/2ad/1ak/1 ak/1 1

am/1ar/1ba/2

br/2da/1ka/1

ko/1ma/1ob/1

ra/2 ra/1 24.00 3.46 3

∠ XY = 77.5◦

Page 33: Intrusion Detection and Malware Analysis - Anomaly-based IDS€¦ · urg ptr options src port dst port len checksum type code checksum. PHAD algorithm Probability of novel feature

Embedding example

X = abrakadabraY = barakobama

X Y X · Ya/5 a/4 20b/2 b/2 4d/1k/1 k/1 1

m/1o/1

r/2 r/1 25.92 4.90 27

∠ XY = 21.5◦

X Y X · Yab/2ad/1ak/1 ak/1 1

am/1ar/1ba/2

br/2da/1ka/1

ko/1ma/1ob/1

ra/2 ra/1 24.00 3.46 3

∠ XY = 77.5◦

Page 34: Intrusion Detection and Malware Analysis - Anomaly-based IDS€¦ · urg ptr options src port dst port len checksum type code checksum. PHAD algorithm Probability of novel feature

Experimental evaluation of MIND

Data:3 weeks of HTTP-Traffic at FIRST (622,734 HTTP requests)120 attacks generated by Metasploit

Quality measure: Receiver Operating Characteristic (ROC)

0.0001 0.001 0.010

0.2

0.4

0.6

0.8

1

False−positive rate

Tru

e−po

sitiv

e ra

te

ReMINDSnort IDS

0.0001 0.001 0.010

0.2

0.4

0.6

0.8

1

False−positive rate

Tru

e−po

sitiv

e ra

te

ReMINDSSADAnagramTokengram

Page 35: Intrusion Detection and Malware Analysis - Anomaly-based IDS€¦ · urg ptr options src port dst port len checksum type code checksum. PHAD algorithm Probability of novel feature

MIND resume

Anomaly detection over connection payloadsExcellent accuracy: over 90% detection with no false alarmsLearning in the presense of attacks.Sophisticated data structures.Drawbacks:

Computationally more involved: efficient data structures andfine-tuning are required.Delay in detection until completion of TCP connections:incremental processing is needed.

Page 36: Intrusion Detection and Malware Analysis - Anomaly-based IDS€¦ · urg ptr options src port dst port len checksum type code checksum. PHAD algorithm Probability of novel feature

Lessons learned

Modern anomaly detection methods can achieve superiordetection accuracy to signature-based IDS with low falsealarm rates.Anomaly detection methods must be equipped for learningfrom contaminated date (or be able to automatically cleandata)Efficient data structures for feature extraction are the key tosuccess of anomaly detection.

Page 37: Intrusion Detection and Malware Analysis - Anomaly-based IDS€¦ · urg ptr options src port dst port len checksum type code checksum. PHAD algorithm Probability of novel feature

Recommended reading

M. Mahoney and P. Chan.Learning nonstationary models of normal network traffic for detecting novelattacks.In Proc. of ACM SIGKDD International Conference on Knowledge Discoveryand Data Mining (KDD), pages 376–385, 2002.

M. Mahoney and P. Chan.Learning rules for anomaly detection of hostile network traffic.In Proc. of International Conference on Data Mining (ICDM), 2003.

K. Rieck and P. Laskov.Language models for detection of unknown attacks in network traffic.Journal in Computer Virology, 2(4):243–256, 2007.

K. Wang and S. Stolfo.Anomalous payload-based network intrusion detection.In Recent Adances in Intrusion Detection (RAID), pages 203–222, 2004.