Data Science Driven Malware Detection

1© Copyright 2015 Pivotal. All rights reserved. 1

Data Science Driven Malware DetectionMalicious Domain Association

Anirudh Kondaveeti, PhDPrincipal Data Scientist

2© Copyright 2015 Pivotal. All rights reserved.

Project Goal

Goal: Find domains that have time and user based co-occurrence relationships to aid the detection of coordinated network attacks.

Example: Domain A is a watering hole. It redirects users to an exploit kit at Domain B within a short time window.

– B is relatively unknown: Visiting B is a low frequency (support) event.

– B is almost always redirected from A: The conditional probability (confidence) of an

initial visit to A is high given B is visited later on.

User visits watering hole

domain A

Domain B hosts exploit

kit

Watering hole domain A

redirects to domain B

User machine compromised


Data Sources & Preprocessing

Historical Proxy Logs – Information about “who is accessing which website at what time”– Approx. 3 months of data with billions of connection records

Local Domain White List – List of non-malicious websites

Preprocessing

Host Name Normalization

(anirudh.facebook.com -> facebook.com)

Filter Invalid Host Names

( www.facebook,ca)

Identify “unpopular” domains

( www.francelegal.com)

User Specific Sessionization


User-Specific Sessionization

Each user’s proxy logs are sessionized so that two consecutive connections in the same session occur within a user-specified time window (e.g. 60s).

Sequential patterns are derived from sessionized data.

Connection Time Domain Session ID

2015-07-03 12:41:08 googlevideo.com 1

2015-07-03 12:41:09 twitter.com 1

2015-07-03 12:41:12 youtube.com 1

2015-07-03 12:41:14 doubleclick.net 1

2015-07-03 12:41:15 google.com 1

2015-07-03 12:41:15 googleanalytics.com 1

2015-07-03 12:41:28 youtube.com 1

2015-07-03 12:59:23 facebook.com 2

2015-07-03 12:59:24 yahoo.com 2

>60s apart, start a new session


Modeling Approaches

Sequential Pattern Mining– Find time-ordered co-occurrence relationships between multiple domains.– Output low frequency, high confidence sequences of domains: [{Domain1},{Domain2, Domain3},…] => [DomainN].

Graph Mining– Build a “social network” graph between domains by creating edges between pairs of domains that are associated with high confidence– Use graph based algorithms to find fully and partially connected subgraphs

Two approaches can be used in conjunction to compliment each other.


Modeling Framework Design Considerations

Operational feasibility– Incremental data processing and modeling on incoming new data, e.g. on a weekly

basis, to distribute workload over time.– Results are updated to incorporate new model outputs.

Computational tractability– Implement most of the modeling frameworks in plain SQL, and design efficient

Window functions to achieve better runtime performance.– Explicit PL/R routine parallelization to leverage the Massively Parallel Processing

architecture of the Greenplum database.


An Incremental Modeling Framework

Initial Proxy Logs & Domain Whitelist

Preprocessed Proxy Logs

• Host normalization & validation• Data filtering• Sessionization

Model-Specific Results

Model Execution:• Sequential Pattern Mining• Graph Mining

New Proxy Logs & (Possibly) Updated Domain Whitelist

Preprocessed New Proxy Logs

• Host normalization & validation• Data filtering • Sessionization

Updated Model-Specific Results

Initial Run

Update

Model Update:• Sequential Pattern Mining• Graph Mining


Modeling ApproachesSequential Pattern Mining


Model Execution: Sequential Pattern Mining

Create time-ordered domain sequences from

sessionized data

Given a list of targeted domains (e.g. rare

domains), select subset of sequences containing

those domains

Find high confidence, low support sequential patterns of targeted domains in parallel


Sequence Creation

Each sequence contains domains in a session by the same user.

Domains are ordered by connection time.

Sequence for example on the right – Sequence 1 : [ {googlevideo.com}, {twitter.com},

{youtube.com}, {doubleclick.net}, {google.com}, {googleanalytics.com} ]

– Sequence 2: [{facebook.com}, {yahoo.com}]

Connection Time Domain Session ID

2015-01-06 14:41:08 googlevideo.com 1

2015-01-06 14:41:09 twitter.com 1

2015-01-06 14:41:12 youtube.com 1

2015-01-06 14:41:14 doubleclick.net 1

2015-01-06 14:41:15 google.com 1

2015-01-06 14:41:15 googleanalytics.com 1

2015-01-06 14:59:23 facebook.com 2

2015-01-06 14:59:24 yahoo.com 2


Sequence Statistics

sup: Support of a pattern P is the ratio of sequences in which a pattern occurs– sup({a,e}) = 2/10

conf: Confidence of a rule X => Y is proportion of transactions containing X that also contain Y– conf({a => e}) = sup({a,e})/sup({a}) = 2/5

#users: Number of distinct users for which a pattern P occurs– #users({a}) = 1

sup and #users follow monotone property i.e. – {a,e} {a}– sup({a,e}) ≤ sup({a})– #users({a,e}) ≤ #users({a})

10 sequences from a single user


Sequential Pattern Mining (SPM) in Parallel Developed a scalable algorithm in Greenplum database (GPDB) to identify patterns with

low support and high confidence patterns occurring in a minimum number of user sequences.

High confidence patterns relating to a given set of domains are obtained in parallel:

i.e., SPM runs independently on different subsets of sequences for different domains.

SELECT a_targeted_domain, sequential_pattern_mining(min_support, min_confidence,

min_num_users)FROM input_table

Pseudo code:

Find domain A with small support (or

known bad domain)

Subset sequences from data containing A

Find sequential patterns of A with high confidence

Repeat for all A in parallel on separate GPDB node


Relative Confidence to Adjust Ranking of Patterns

For each domain of interest, SPM is run only on the subset of sequences containing that domain. This may cause some sequential patterns to have artificially high confidence.

Recall: confidence(X=>Y):=support(<X,Y>)/support(X)=|<X,Y>|/|X|. |X|, the number of sequences in the subset that contain the left hand side pattern, may not reflect the popularity of X in the full dataset.

We define relative confidence as: relative_confidence(X=>Y):=|<X,Y>|/|Xi|fullset

where|Xi|fullset is the number of sequences in the full dataset that contain the left hand pattern.

Relative confidence favors the pattern whose left hand side contains less popular domains (see the highlighted example below).

Relative confidence favors unpopular left hand side pattern

Domain Pattern Supp Conf Rel Conf

revenueindia.net

<{google.com},{facebook.com}> => <{revenueindia.net}> 0.079 0.75 0.0001

revenueindia.net

<{google.com}, {fileshare.com}> => <{revenueindia.net}> 0.071 0.75 0.067

revenueindia.net

<{fileshare.com},{redworm.com}> => <{revenueindia.net}> 0.030 1.00 0.51


Model Update: Sequential Pattern Mining

The model update module for sequential pattern mining follows a similar workflow as its model execution module.

One additional step is simply to merge the new results obtained from the incoming new data with the existing set of patterns, including updating rule quality metrics: support, confidence, etc.

Create time-ordered domain sequences from

new sessionized data

Given a list of targeted domains (e.g. rare

domains), select subset of sequences containing

those domains

Find high confidence, low support sequential

patterns of targeted domains in parallel

Merge new results with the existing set of

patterns.


Modeling ApproachesGraph Mining


Model Execution: Graph Mining

Construct “baskets” of domains (co-

occurrence domains) by running a sliding

window of certain time interval through data

Find high confidence, low support pairwise association rules of

the form Domain 1 => Domain

2

Create social network of domains

Find partially and fully connected sub-graphs


Construction of “Baskets”

Domains visited by a user in a certain time window form a “basket”, analogous to items purchased in a single transaction as in market basket analysis.

The time interval for the sliding window (60s window used in the implementation) can be tuned.

A basket contains distinct domains in a sliding window: Example on right:

Basket 1 = {googlevideo.com, twitter.com, youtube.com, doubleclick.net, google.com}

Connection Time Domain

2015-01-06 14:41:00 googlevideo.com

2015-01-06 14:41:09 twitter.com

2015-01-06 14:41:12 youtube.com

2015-01-06 14:41:14 doubleclick.net

2015-01-06 14:42:00 google.com

2015-01-06 14:42:05 googleanalytics.com

2015-01-06 14:42:08 pivotal.io

2015-01-06 14:59:23 facebook.com

2015-01-06 14:59:24 yahoo.com

1

2


Pairwise Association Rule Mining

Given domain-to-basket assignments, pairwise association rule mining mainly involves evaluation of:– Co-occurrence frequency: the number of times two domains fall in a common basket.– Conditional probability: probability of seeing domain 2 given domain 1 is present.

Pairwise rule mining is implemented in plain SQL in a scalable fashion.

Domain A Domain B # {A,B} # A # B P(A|B) P(B|A) # A

to B# B to A

# AB Same Time

Max(# User

Names/M)

# Date

Min Date

Max Date

pivotal.io montecarlo.com 10 560 10 1.000000 0.017857 9 0 1 1 12015-02-

262015-02-26

pivotal.io bigbangtheory.com 25 560 26 0.961538 0.044643 21 4 0 2 12015-02-

232015-02-23

pivotal.io sciencefiction.com 78 560 97 0.804124 0.139286 61 15 2 4 82015-01-

232015-02-17

High confidence (>0.5) associations involving multiple users over several days (e.g. highlighted rules) are generally more interesting.


Exploring Interactions between Domains

To explore the interactions between domains, we build an undirected correlation graph using the discovered pairwise domain association rules.

Each node in the graph is a domain. An edge connects two domains if their co-occurrence confidence is higher than a threshold (e.g. 0.2).

The example on the right shows the tightly connected “social network” of a particular domain.

Partially and fully connected networks indicate possible waterhole or bot-net attacks.

Question: How to quantify the connectivity of a network?

0.25

0.37

0.71

0.52

0.1

0.6

0.1

Weight of Edge denotes the confidence

Node denotes the domain

abc.com

xyz.com

hga.com

hebf.com


OddBall Metrics for Graph Anomaly Detection

We take the OddBall approach* to quantify the connectivity of each domain’s network:– Identify each domain’s one-step neighborhood (also called ego-net).– Extract two graph features from the ego-net:

▪ N: Number of neighbors ▪ E: Number of edges in the ego-net

The number of neighbors and the number of edges follow a power law: E ∝ Nα , 1≤ α ≤ 2

* OddBall: Spotting Anomalies in Weighted Graphs, Leman Akoglu et al., PAKDD, Hyderabad, India, June 2010.

Picture Source: ICDM’12 tutorial on graph anomaly detection

• Use log(E)/log(N) to approximate the slope. log(E)/log(N) > 1 indicates some degree of connectivity among neighbors.

• The higher the ratio the higher degree of connectivity (given same number of neighbors). Generally OddBall ratio of >1.5 is more interesting.

• One can additionally compute clique percentage: the ratio between E and the number of edges needed to form a clique: E/[(N2+N)/2], to measure network connectivity.


Sample Domains with Highly Connected Networks

Highlighted domain has a fully connected network, a clique!

Domain#

Neighbors

Neighbours#

Edge

log(E)/

log(N)

Clique Percen

t# User Names

a.com 4 {b.com, c.com,d.com, e.com} 10 1.66 100% 6s.com 7 {a.com, b.com, c.com, d.com, e.com, f.com} 27 1.69 96% 9r.com 9 {a.com, b.com, c.com, d.com, e.com, f.com, g.com, h.com, i.com} 43 1.71 96% 7abc.ru 9 {a.com, b.com, c.com, d.com, e.com, f.com, g.com, h.com, i.com} 42 1.70 93% 11

d.com

e.com

b.com

c.com

a.com


Detecting Isolated Clusters

Given the domain correlation graph, one can also identify isolated groups of domains that only interact with domains in the same group, but not others (a bot-net like structure).

This can be formulated as the task of finding connected components (CCs) in a graph.

The example below show that malicious sites tend to exist in small CCs.

Sample Connected Component

qre.com

jekc.com

fbc.com

abc.comghk.com

bcd.com

Known malicious site


Operationalization and Outlook


Operationalization Vision

Run Algorithms

Inspect Anomalies

Evaluate Model OutputsRefine Algorithms

Load New Data

• Owned by Data Engineer/Data Scientist• Incrementally (e.g. weekly) update models

using new batches of data, e.g. as a Cron job

• Owned by security team

• Ideally model outputs provided via interactive web dashboards

• Feedback on model performance from security team.

• Opportunities for refinement and ideas for new models

• Owned by Data Scientist• Refine algorithms

• Owned by Data Engineer• Load new data

BUILT FOR THE SPEED OF BUSINESS

Data Science Driven Malware Detection

Data & Analytics

Transcript of Data Science Driven Malware Detection