Data Science Driven Malware Detection
-
Upload
pivotal -
Category
Data & Analytics
-
view
1.229 -
download
4
Transcript of Data Science Driven Malware Detection
1© Copyright 2015 Pivotal. All rights reserved. 1
Data Science Driven Malware DetectionMalicious Domain Association
Anirudh Kondaveeti, PhDPrincipal Data Scientist
2© Copyright 2015 Pivotal. All rights reserved.
Project Goal
Goal: Find domains that have time and user based co-occurrence relationships to aid the detection of coordinated network attacks.
Example: Domain A is a watering hole. It redirects users to an exploit kit at Domain B within a short time window.
– B is relatively unknown: Visiting B is a low frequency (support) event.
– B is almost always redirected from A: The conditional probability (confidence) of an
initial visit to A is high given B is visited later on.
User visits watering hole
domain A
Domain B hosts exploit
kit
Watering hole domain A
redirects to domain B
User machine compromised
3© Copyright 2015 Pivotal. All rights reserved.
Data Sources & Preprocessing
Historical Proxy Logs – Information about “who is accessing which website at what time”– Approx. 3 months of data with billions of connection records
Local Domain White List – List of non-malicious websites
Preprocessing
Host Name Normalization
(anirudh.facebook.com -> facebook.com)
Filter Invalid Host Names
( www.facebook,ca)
Identify “unpopular” domains
( www.francelegal.com)
User Specific Sessionization
4© Copyright 2015 Pivotal. All rights reserved.
User-Specific Sessionization
Each user’s proxy logs are sessionized so that two consecutive connections in the same session occur within a user-specified time window (e.g. 60s).
Sequential patterns are derived from sessionized data.
Connection Time Domain Session ID
2015-07-03 12:41:08 googlevideo.com 1
2015-07-03 12:41:09 twitter.com 1
2015-07-03 12:41:12 youtube.com 1
2015-07-03 12:41:14 doubleclick.net 1
2015-07-03 12:41:15 google.com 1
2015-07-03 12:41:15 googleanalytics.com 1
2015-07-03 12:41:28 youtube.com 1
2015-07-03 12:59:23 facebook.com 2
2015-07-03 12:59:24 yahoo.com 2
>60s apart, start a new session
5© Copyright 2015 Pivotal. All rights reserved.
Modeling Approaches
Sequential Pattern Mining– Find time-ordered co-occurrence relationships between multiple domains.– Output low frequency, high confidence sequences of domains: [{Domain1},{Domain2, Domain3},…] => [DomainN].
Graph Mining– Build a “social network” graph between domains by creating edges between pairs of domains that are associated with high confidence– Use graph based algorithms to find fully and partially connected subgraphs
Two approaches can be used in conjunction to compliment each other.
6© Copyright 2015 Pivotal. All rights reserved.
Modeling Framework Design Considerations
Operational feasibility– Incremental data processing and modeling on incoming new data, e.g. on a weekly
basis, to distribute workload over time.– Results are updated to incorporate new model outputs.
Computational tractability– Implement most of the modeling frameworks in plain SQL, and design efficient
Window functions to achieve better runtime performance.– Explicit PL/R routine parallelization to leverage the Massively Parallel Processing
architecture of the Greenplum database.
7© Copyright 2015 Pivotal. All rights reserved.
An Incremental Modeling Framework
Initial Proxy Logs & Domain Whitelist
Preprocessed Proxy Logs
• Host normalization & validation• Data filtering• Sessionization
Model-Specific Results
Model Execution:• Sequential Pattern Mining• Graph Mining
New Proxy Logs & (Possibly) Updated Domain Whitelist
Preprocessed New Proxy Logs
• Host normalization & validation• Data filtering • Sessionization
Updated Model-Specific Results
Initial Run
Update
Model Update:• Sequential Pattern Mining• Graph Mining
9© Copyright 2015 Pivotal. All rights reserved.
Model Execution: Sequential Pattern Mining
Create time-ordered domain sequences from
sessionized data
Given a list of targeted domains (e.g. rare
domains), select subset of sequences containing
those domains
Find high confidence, low support sequential patterns of targeted domains in parallel
10© Copyright 2015 Pivotal. All rights reserved.
Sequence Creation
Each sequence contains domains in a session by the same user.
Domains are ordered by connection time.
Sequence for example on the right – Sequence 1 : [ {googlevideo.com}, {twitter.com},
{youtube.com}, {doubleclick.net}, {google.com}, {googleanalytics.com} ]
– Sequence 2: [{facebook.com}, {yahoo.com}]
Connection Time Domain Session ID
2015-01-06 14:41:08 googlevideo.com 1
2015-01-06 14:41:09 twitter.com 1
2015-01-06 14:41:12 youtube.com 1
2015-01-06 14:41:14 doubleclick.net 1
2015-01-06 14:41:15 google.com 1
2015-01-06 14:41:15 googleanalytics.com 1
2015-01-06 14:59:23 facebook.com 2
2015-01-06 14:59:24 yahoo.com 2
11© Copyright 2015 Pivotal. All rights reserved.
Sequence Statistics
sup: Support of a pattern P is the ratio of sequences in which a pattern occurs– sup({a,e}) = 2/10
conf: Confidence of a rule X => Y is proportion of transactions containing X that also contain Y– conf({a => e}) = sup({a,e})/sup({a}) = 2/5
#users: Number of distinct users for which a pattern P occurs– #users({a}) = 1
sup and #users follow monotone property i.e. – {a,e} {a}– sup({a,e}) ≤ sup({a})– #users({a,e}) ≤ #users({a})
10 sequences from a single user
12© Copyright 2015 Pivotal. All rights reserved.
Sequential Pattern Mining (SPM) in Parallel Developed a scalable algorithm in Greenplum database (GPDB) to identify patterns with
low support and high confidence patterns occurring in a minimum number of user sequences.
High confidence patterns relating to a given set of domains are obtained in parallel:
i.e., SPM runs independently on different subsets of sequences for different domains.
SELECT a_targeted_domain, sequential_pattern_mining(min_support, min_confidence,
min_num_users)FROM input_table
Pseudo code:
Find domain A with small support (or
known bad domain)
Subset sequences from data containing A
Find sequential patterns of A with high confidence
Repeat for all A in parallel on separate GPDB node
13© Copyright 2015 Pivotal. All rights reserved.
Relative Confidence to Adjust Ranking of Patterns
For each domain of interest, SPM is run only on the subset of sequences containing that domain. This may cause some sequential patterns to have artificially high confidence.
Recall: confidence(X=>Y):=support(<X,Y>)/support(X)=|<X,Y>|/|X|. |X|, the number of sequences in the subset that contain the left hand side pattern, may not reflect the popularity of X in the full dataset.
We define relative confidence as: relative_confidence(X=>Y):=|<X,Y>|/|Xi|fullset
where|Xi|fullset is the number of sequences in the full dataset that contain the left hand pattern.
Relative confidence favors the pattern whose left hand side contains less popular domains (see the highlighted example below).
Relative confidence favors unpopular left hand side pattern
Domain Pattern Supp Conf Rel Conf
revenueindia.net
<{google.com},{facebook.com}> => <{revenueindia.net}> 0.079 0.75 0.0001
revenueindia.net
<{google.com}, {fileshare.com}> => <{revenueindia.net}> 0.071 0.75 0.067
revenueindia.net
<{fileshare.com},{redworm.com}> => <{revenueindia.net}> 0.030 1.00 0.51
14© Copyright 2015 Pivotal. All rights reserved.
Model Update: Sequential Pattern Mining
The model update module for sequential pattern mining follows a similar workflow as its model execution module.
One additional step is simply to merge the new results obtained from the incoming new data with the existing set of patterns, including updating rule quality metrics: support, confidence, etc.
Create time-ordered domain sequences from
new sessionized data
Given a list of targeted domains (e.g. rare
domains), select subset of sequences containing
those domains
Find high confidence, low support sequential
patterns of targeted domains in parallel
Merge new results with the existing set of
patterns.
16© Copyright 2015 Pivotal. All rights reserved.
Model Execution: Graph Mining
Construct “baskets” of domains (co-
occurrence domains) by running a sliding
window of certain time interval through data
Find high confidence, low support pairwise association rules of
the form Domain 1 => Domain
2
Create social network of domains
Find partially and fully connected sub-graphs
17© Copyright 2015 Pivotal. All rights reserved.
Construction of “Baskets”
Domains visited by a user in a certain time window form a “basket”, analogous to items purchased in a single transaction as in market basket analysis.
The time interval for the sliding window (60s window used in the implementation) can be tuned.
A basket contains distinct domains in a sliding window: Example on right:
Basket 1 = {googlevideo.com, twitter.com, youtube.com, doubleclick.net, google.com}
Connection Time Domain
2015-01-06 14:41:00 googlevideo.com
2015-01-06 14:41:09 twitter.com
2015-01-06 14:41:12 youtube.com
2015-01-06 14:41:14 doubleclick.net
2015-01-06 14:42:00 google.com
2015-01-06 14:42:05 googleanalytics.com
2015-01-06 14:42:08 pivotal.io
2015-01-06 14:59:23 facebook.com
2015-01-06 14:59:24 yahoo.com
1
2
18© Copyright 2015 Pivotal. All rights reserved.
Pairwise Association Rule Mining
Given domain-to-basket assignments, pairwise association rule mining mainly involves evaluation of:– Co-occurrence frequency: the number of times two domains fall in a common basket.– Conditional probability: probability of seeing domain 2 given domain 1 is present.
Pairwise rule mining is implemented in plain SQL in a scalable fashion.
Domain A Domain B # {A,B} # A # B P(A|B) P(B|A) # A
to B# B to A
# AB Same Time
Max(# User
Names/M)
# Date
Min Date
Max Date
pivotal.io montecarlo.com 10 560 10 1.000000 0.017857 9 0 1 1 12015-02-
262015-02-26
pivotal.io bigbangtheory.com 25 560 26 0.961538 0.044643 21 4 0 2 12015-02-
232015-02-23
pivotal.io sciencefiction.com 78 560 97 0.804124 0.139286 61 15 2 4 82015-01-
232015-02-17
High confidence (>0.5) associations involving multiple users over several days (e.g. highlighted rules) are generally more interesting.
19© Copyright 2015 Pivotal. All rights reserved.
Exploring Interactions between Domains
To explore the interactions between domains, we build an undirected correlation graph using the discovered pairwise domain association rules.
Each node in the graph is a domain. An edge connects two domains if their co-occurrence confidence is higher than a threshold (e.g. 0.2).
The example on the right shows the tightly connected “social network” of a particular domain.
Partially and fully connected networks indicate possible waterhole or bot-net attacks.
Question: How to quantify the connectivity of a network?
0.25
0.37
0.71
0.52
0.1
0.6
0.1
Weight of Edge denotes the confidence
Node denotes the domain
abc.com
xyz.com
hga.com
hebf.com
20© Copyright 2015 Pivotal. All rights reserved.
OddBall Metrics for Graph Anomaly Detection
We take the OddBall approach* to quantify the connectivity of each domain’s network:– Identify each domain’s one-step neighborhood (also called ego-net).– Extract two graph features from the ego-net:
▪ N: Number of neighbors ▪ E: Number of edges in the ego-net
The number of neighbors and the number of edges follow a power law: E ∝ Nα , 1≤ α ≤ 2
* OddBall: Spotting Anomalies in Weighted Graphs, Leman Akoglu et al., PAKDD, Hyderabad, India, June 2010.
Picture Source: ICDM’12 tutorial on graph anomaly detection
• Use log(E)/log(N) to approximate the slope. log(E)/log(N) > 1 indicates some degree of connectivity among neighbors.
• The higher the ratio the higher degree of connectivity (given same number of neighbors). Generally OddBall ratio of >1.5 is more interesting.
• One can additionally compute clique percentage: the ratio between E and the number of edges needed to form a clique: E/[(N2+N)/2], to measure network connectivity.
21© Copyright 2015 Pivotal. All rights reserved.
Sample Domains with Highly Connected Networks
Highlighted domain has a fully connected network, a clique!
Domain#
Neighbors
Neighbours#
Edge
log(E)/
log(N)
Clique Percen
t# User Names
a.com 4 {b.com, c.com,d.com, e.com} 10 1.66 100% 6s.com 7 {a.com, b.com, c.com, d.com, e.com, f.com} 27 1.69 96% 9r.com 9 {a.com, b.com, c.com, d.com, e.com, f.com, g.com, h.com, i.com} 43 1.71 96% 7abc.ru 9 {a.com, b.com, c.com, d.com, e.com, f.com, g.com, h.com, i.com} 42 1.70 93% 11
d.com
e.com
b.com
c.com
a.com
22© Copyright 2015 Pivotal. All rights reserved.
Detecting Isolated Clusters
Given the domain correlation graph, one can also identify isolated groups of domains that only interact with domains in the same group, but not others (a bot-net like structure).
This can be formulated as the task of finding connected components (CCs) in a graph.
The example below show that malicious sites tend to exist in small CCs.
Sample Connected Component
qre.com
jekc.com
fbc.com
abc.comghk.com
bcd.com
Known malicious site
24© Copyright 2015 Pivotal. All rights reserved.
Operationalization Vision
Run Algorithms
Inspect Anomalies
Evaluate Model OutputsRefine Algorithms
Load New Data
• Owned by Data Engineer/Data Scientist• Incrementally (e.g. weekly) update models
using new batches of data, e.g. as a Cron job
• Owned by security team
• Ideally model outputs provided via interactive web dashboards
• Feedback on model performance from security team.
• Opportunities for refinement and ideas for new models
• Owned by Data Scientist• Refine algorithms
• Owned by Data Engineer• Load new data