Intrusion Detection and Malware Analysis - Automatic signature

27
Intrusion Detection and Malware Analysis Automatic signature generation Pavel Laskov Wilhelm Schickard Institute for Computer Science

Transcript of Intrusion Detection and Malware Analysis - Automatic signature

Intrusion Detection and Malware AnalysisAutomatic signature generation

Pavel LaskovWilhelm Schickard Institute for Computer Science

The quest for attack signatures

Post-mortem: security research, computer forensicsReactive: analysis of anomalies (forensic sinks)Proactive: acquisition and analysis of malicious data

A general framework for ASG

Clustering: finding groups of similar malicious eventsToken extraction: finding common patterns in malicious dataSignature assembly: assessment of extracted tokens

Signature format

A set of tokens t1, . . . tn

A set of support values ν1, . . . , νn

A threshold θ

Evaluation rule:n

∑i=1

νi M(ti, s) > θ,

where

M(ti, s) =

{1 if ti is present in a string s0 otherwise

Signature examples

Banload keylogger:

Storm worm:

Invariance as a main principle of ASG

Invariance is inherent for attacks due to extremely specificnature of exploits.Diversity makes signatures more general and accurate.Too much diversity makes signatures smaller and leads tofalse positives.

Token extraction: basic definitions

A token is a substring found in malicious content thatsatisfies pre-defined empirical conditions, such as:

minimal lengthminimial support: percentage of malicious events it occurs in

A pair of tokens is said to be distinct if they are not asubstring of one another.A token s that is a substring of another token t is ignoredunless it satisfies tokenization conditions while being notpart of t.

Distinct token examples

s1 = “ddddfddf”, s2 = “ddddedde”Distinct tokens: “dddd”, “dd”

s1 = “dddfdddf”, s2 = “dddeddde”Distinct tokens: “ddd”s1 = “abcbabcbaba”, s2 = “abcbabxbaby”

Distinct tokens: “abcbab”, “ba”

Distinct token examples

s1 = “ddddfddf”, s2 = “ddddedde”Distinct tokens: “dddd”, “dd”s1 = “dddfdddf”, s2 = “dddeddde”Distinct tokens: “ddd”

s1 = “abcbabcbaba”, s2 = “abcbabxbaby”

Distinct tokens: “abcbab”, “ba”

Distinct token examples

s1 = “ddddfddf”, s2 = “ddddedde”Distinct tokens: “dddd”, “dd”s1 = “dddfdddf”, s2 = “dddeddde”Distinct tokens: “ddd”s1 = “abcbabcbaba”, s2 = “abcbabxbaby”

Distinct tokens: “abcbab”, “ba”

Distinct token examples

s1 = “ddddfddf”, s2 = “ddddedde”Distinct tokens: “dddd”, “dd”s1 = “dddfdddf”, s2 = “dddeddde”Distinct tokens: “ddd”s1 = “abcbabcbaba”, s2 = “abcbabxbaby”Distinct tokens: “abcbab”, “ba”

Token extraction: basic algorithm

Traverse a GST from top to bottom.For each node, output its path from the root if its depth isgreater than Lmin and the number of non-zero entries in itsleaf count is greater than νn.Output the percentage of non-zero entries in its leaf countas a token support.

Token extraction example

Input: strings “abbaa” and “baaaa”, Lmin = 1, ν = 100%

Output:

“a”, “aa”, “b”, “ba”, “baa”

a # $ b

a # $ bbaa# aa baa#

a # $ aa$ #

a$ $

6 6

3 4 2 1

1 3 1 1

0 2

Token extraction example

Input: strings “abbaa” and “baaaa”, Lmin = 1, ν = 100%

Output: “a”

, “aa”, “b”, “ba”, “baa”

a # $ b

a # $ bbaa# aa baa#

a # $ aa$ #

a$ $

6 6

3 4 2 1

1 3 1 1

0 2

Token extraction example

Input: strings “abbaa” and “baaaa”, Lmin = 1, ν = 100%

Output: “a”, “aa”

, “b”, “ba”, “baa”

a # $ b

a # $ bbaa# aa baa#

a # $ aa$ #

a$ $

6 6

3 4 2 1

1 3 1 1

0 2

Token extraction example

Input: strings “abbaa” and “baaaa”, Lmin = 1, ν = 100%

Output: “a”, “aa”, “b”

, “ba”, “baa”

a # $ b

a # $ bbaa# aa baa#

a # $ aa$ #

a$ $

6 6

3 4 2 1

1 3 1 1

0 2

Token extraction example

Input: strings “abbaa” and “baaaa”, Lmin = 1, ν = 100%

Output: “a”, “aa”, “b”, “ba”, “baa”

a # $ b

a # $ bbaa# aa baa#

a # $ aa$ #

a$ $

6 6

3 4 2 1

1 3 1 1

0 2

Open issues in the basic algorithm

How can we define unique “end-of-string” markers for a fullalphabet of byte values?

EscapingSpecial encoding: extra bytes (bits)

How can we avoid generation of non-distinct tokens?Post-processingComplex suffix tree accounting

Open issues in the basic algorithm

How can we define unique “end-of-string” markers for a fullalphabet of byte values?

EscapingSpecial encoding: extra bytes (bits)

How can we avoid generation of non-distinct tokens?Post-processingComplex suffix tree accounting

Auxiliary node data

Let N be a given internal node of a GST and LN be a label pathfrom the root to N.

Leaf index (LI) is a set of suffixes that are descendants of agiven node (characterized by their positions in a string).Prefix leaf index (PLI) is a set of suffixes containing distincttokens for which LN is a prefix.Suffix leaf index (SLI) is a set of suffixes containing distincttokens for which LN is a suffix.

Exemplary GST with leaf indices

Input: strings “abbaa” and “baaaa”

1

2 6 6 3

4 5 5 1 5 2

6 4 4 1 3

2 3

a # $ b

a # $ bbaa# aa baa#

a # $ aa$ #

a$ $

[145 | 2345] [23 | 1]

[4 | 234] [3 | 1]

[∅ | 23]

Sufficient condition for token distinctness

TheoremA label path LN is a distinct token if at least νn components in theset LI\(PLI∪ SLI) are non-empty.

Token extraction with distinctness check

Traverse the suffix tree in the reverse label depth orderperforming the following steps for each node:

1. Compute B = PLI∪ SLI.2. Compute T = LI\B. If more than νn components of T are

non-emptyB = B∪ Treport a distinct token

3. Propagate B to the PLI to the parent node.4. Propagate B shifted by one to the SLI to the suffix link node

(if one exists).

Signature assembly

Goal: remove tokens that frequently occur in normal traffic.Rules for removal:

ν−(ti) > ν+(ti)ν−(ti) > νmax

Underlying problem: set matching.Algorithms:

Knuth-Morris-Pratt: O(k(n + M))Aho-Corasick: O(k + n + M)

Signature refinement

Given the set of token/support pairs {(t1, ν1), . . . , (tk, νk)},signature refinement consists of the following steps:

Normalization: support values are normalized so that theyadd up to 1:

νi =νi

∑kj=1 νk

Calibration: the threshold θ is calibrated on benign data soas not to exceed some maximal false positive rate.

Lessons learned

Automatic signature generation enables one to quicklyextract signatures for samples of malicious and benigntraffic.Careful choice of algorithms and data structure is importantfor practical feasibility of ASG.ASG enable some very interesting applications to malwareanalysis, especially detection of malware communication.

Recommended reading

D. Gusfield.Algorithms on strings, trees, and sequences.Cambridge University Press, 1997.

Konrad Rieck, Guido Schwenk, Tobias Limmer, Thorsten Holz, andPavel Laskov.Botzilla: Detecting the "phoning home" of malicious software.In Proc. of 25th ACM Symposium on Applied Computing (SAC),March 2010.(to appear).