Data Tactics Data Science Brown Bag (April 2014)
-
Upload
richard-heimann -
Category
Data & Analytics
-
view
108 -
download
0
description
Transcript of Data Tactics Data Science Brown Bag (April 2014)
![Page 1: Data Tactics Data Science Brown Bag (April 2014)](https://reader034.fdocuments.net/reader034/viewer/2022042814/54c6a4e04a7959e4208b458a/html5/thumbnails/1.jpg)
The Power of Partnership – from Vision to Reality
L-3 Data Tactics: Data Science Brown Bag
Welcome! Hard and Soft Clusters and Cyber Data April 22, 2014 !R2 = 500; p<.05 asymptotically approaching perfect
![Page 2: Data Tactics Data Science Brown Bag (April 2014)](https://reader034.fdocuments.net/reader034/viewer/2022042814/54c6a4e04a7959e4208b458a/html5/thumbnails/2.jpg)
!•Why a (our 3rd) Data Science Brown Bag (Rich H.)? !
•About US & About YOU (Rich H.)!!!•Case Studies in Cyber:
•What is Clustering, Honeypots and Density Based Clustering (Max W.)? •What is Optics Clustering and how is it different than DB Clustering? …and how can it be used for outlier detection. (David P.) •What is so-called soft clustering and how is it different than clustering? …and how can it be used for outlier detection. (Nathan D.)
!•On the horizon...(Rich H.)
DT Data Science Brown Bag: Outline
L-3
![Page 3: Data Tactics Data Science Brown Bag (April 2014)](https://reader034.fdocuments.net/reader034/viewer/2022042814/54c6a4e04a7959e4208b458a/html5/thumbnails/3.jpg)
DT Data Science Brown Bag: Outline
Learning [close] at a pace similar to the pace at which we learn. !Learning and Educating from/to DS to PMs, SWE, and OPs. !
DS2PM: Provide insights for FRIs/RFPs. PM2DS: Atmospherics from our costumers. !DS2SWE: Integrating algorithms. SWE2DS: Accessing data spaces. !DS2OP: How do you consume the outputs of models?
OP2DS: What models are best to present to OPs?
DS: Data Scientist, PM: Program Managers, SWE: Software Engineers, OP: OperatorsL-3
![Page 4: Data Tactics Data Science Brown Bag (April 2014)](https://reader034.fdocuments.net/reader034/viewer/2022042814/54c6a4e04a7959e4208b458a/html5/thumbnails/4.jpg)
The Team: (Geoffrey B., Nathan D., Rich H., David P., Ted P., Shrayes R., Jonathan T., Adam VE., Max W.) !Graduates from top universities… …many of whom are EMC Data Science Certified. !Advanced degrees include:mathematics, computer science, astrophysics, electrical engineering, mechanical engineering, statistics, social sciences. !Base competencies (horizontals): clustering, association rules, regression, naive bayesian classifier, decision trees, time-series, text analysis. !Going beyond the base (verticals)...
About Us: DT Data Science Team
L-3
![Page 5: Data Tactics Data Science Brown Bag (April 2014)](https://reader034.fdocuments.net/reader034/viewer/2022042814/54c6a4e04a7959e4208b458a/html5/thumbnails/5.jpg)
About Us: DT Data Science Team
L-3
Clustering || Regression || Decision Trees || Text Analysis Association Rules || Naive Bayesian Classifier || Time Series Analysis
econ
ometr
ics
spatia
l econ
ometr
ics
graph
theo
ry alg
orithm
s
astrop
hysica
l time-s
eries a
nalys
is
path
plann
ing alg
orithm
sba
yesian
statis
tics
const
rained
optim
izatio
ns
numeric
al inte
gratio
n tec
hniqu
es
PCA bagg
ing/bo
osting
hierar
chica
l mod
els
IRT
spac
e-time
latent
class
analy
sisstr
uctur
al equ
ation m
odelin
g
mixture
models
SVM
maxent
CARTau
toreg
ressiv
e mod
els
ICA
factor
analy
sis
rando
m fores
t
dimen
siona
l redu
ction
topic m
odels
sentim
ent a
nalys
is
frequ
ency
domain
patte
rns
unsup
ervise
d by s
uperv
ised
chan
ge-po
int mod
els
LUBAPDLIS
A DBAC
optics
cluste
ring
![Page 6: Data Tactics Data Science Brown Bag (April 2014)](https://reader034.fdocuments.net/reader034/viewer/2022042814/54c6a4e04a7959e4208b458a/html5/thumbnails/6.jpg)
Hierarchy of Data Scientists
About Us: DT Data Science Team
L-3
![Page 7: Data Tactics Data Science Brown Bag (April 2014)](https://reader034.fdocuments.net/reader034/viewer/2022042814/54c6a4e04a7959e4208b458a/html5/thumbnails/7.jpg)
!!No Free Lunch (NFL) theorems: no algorithm performs
better than any other when their performance is averaged uniformly over all possible problems of a particular type. Algorithms must be designed for a particular domain or style of problem, and that there is no such thing as a general purpose algorithm.
!!!
About Us: DT Data Science Team
L-3
![Page 8: Data Tactics Data Science Brown Bag (April 2014)](https://reader034.fdocuments.net/reader034/viewer/2022042814/54c6a4e04a7959e4208b458a/html5/thumbnails/8.jpg)
ABOUT YOU:35 confirmed, 15 webex, 21 Data Tactics employees, 13 L-3 NSS
employees; Sam Posten was the first to sign-up (webex) and Aaron Glahe was the first to sign-up for in-person! !# define Twitter account names start <- getUser(“L3_NSS”) finish <- getUser(“DataTactics”) !# find all connections independently of each account dt.friends.object <- lookupUsers(start$getFriendsIDs()) l3.friends.object <- lookupUsers(finish$getFriendsIDs()) !#create one large table that relates followers from each account relations <- merge(data.frame(User=“DataTactics”, follower=dt.friends), data.frame(User=l3.friends, Followers=“L3_NSS”), all=TRUE) !#create network layout showing each account’s community and overlap g.followers <- graph.data.frame(relations.followers, directed = T) !#finally plot the graph tkplot(g) L-3
![Page 9: Data Tactics Data Science Brown Bag (April 2014)](https://reader034.fdocuments.net/reader034/viewer/2022042814/54c6a4e04a7959e4208b458a/html5/thumbnails/9.jpg)
ABOUT YOU:
@DataTactics
@L3_NSS
L-3
![Page 10: Data Tactics Data Science Brown Bag (April 2014)](https://reader034.fdocuments.net/reader034/viewer/2022042814/54c6a4e04a7959e4208b458a/html5/thumbnails/10.jpg)
ABOUT YOU:@L3_NSS
@DataTactics
L-3
![Page 11: Data Tactics Data Science Brown Bag (April 2014)](https://reader034.fdocuments.net/reader034/viewer/2022042814/54c6a4e04a7959e4208b458a/html5/thumbnails/11.jpg)
Why Clustering?
L-3
Six Pillars of Data Mining: Clustering has become a workhorse in Big Data and fits into the Six Pillars of Data Mining and our own DS4PM & DS4G framework.
• Anomaly detection: the identification of unusual data records, that might be interesting or data errors that require further investigation.
• Association rule learning: searches for relationships between variables. • Clustering: is the task of discovering groups and structures in the data that are in some way or another "similar", without using known structures in the data.
• Classification: is the task of generalizing known structure to apply to new data. • Regression: finds a function which models the data with the least error. • Summarization: providing a more compact representation of the data set.
!Taxonomy of Questions (ref: DS4PM): !
• Causal Effects: is an approach to the statistical analysis of cause and effect based on the framework of potential outcomes
• Classification/Clustering: identifying to which of a set of observations belong, on the basis of a training set of data or without labels in the clustering approach.
• Outlier Detection: is the identification of events which do not conform to an expected pattern or other items in a dataset.
• Big Data and Analytics: discovering interesting relations between variables in large databases • Measurement Models: statistical models to measure the relationships between the observable variables and the unobserved (or “latent”) quantity
• Text Analysis: refers to the process of deriving high-quality information from text.
![Page 12: Data Tactics Data Science Brown Bag (April 2014)](https://reader034.fdocuments.net/reader034/viewer/2022042814/54c6a4e04a7959e4208b458a/html5/thumbnails/12.jpg)
Max Watson: Max’s background is in physics and applied mathematics. Max completed his undergraduate degree at University of California, Berkeley and completed his PhD at University of California, Santa Barbara in 2012. Max specializes in large-scale simulations, signal analysis and statistical physics - he joined the Data Tactics team in January 2014 and has supported DHS. Max is an EMC Certified Data Scientist. David Pekarek: David’s background is in Mechanical Engineering and specializes in mechanical control systems, optimization, and spatio-temporal statistics. David finished his PhD in 2010 from California Institute of Technology and joined Data Tactics in the fall of 2012 and currently supports DARPA. Nathan Danneman: Nathan’s background is in political science, with specializations in applied statistics and international conflict. He finished his PhD in June of 2013, and joined Data Tactics in May of that same year. He recently co-authored Social Media Mining with R, is active in the local Data Science community and currently supports DARPA. Nathan is an EMC Certified Data Scientist. !!
Today’s presenters:
L-3
![Page 13: Data Tactics Data Science Brown Bag (April 2014)](https://reader034.fdocuments.net/reader034/viewer/2022042814/54c6a4e04a7959e4208b458a/html5/thumbnails/13.jpg)
L-3
Cluster Analysis of Honeypot Data
By Max Watson
![Page 14: Data Tactics Data Science Brown Bag (April 2014)](https://reader034.fdocuments.net/reader034/viewer/2022042814/54c6a4e04a7959e4208b458a/html5/thumbnails/14.jpg)
Outline
14
• What are Honeypots? !• Cluster Analysis! -General Principles
-Density Based Clustering!• Cluster Analysis Applied to Honeypot Data!• Conclusions!
L-3
![Page 15: Data Tactics Data Science Brown Bag (April 2014)](https://reader034.fdocuments.net/reader034/viewer/2022042814/54c6a4e04a7959e4208b458a/html5/thumbnails/15.jpg)
Honeypots
15
8 websites: (USA, 4), (Singapore, 2), (Brazil, 2) [brought to you by Ted Procita]!Collection Period: October 15, 2013 - November 18, 2013!2 Sources of Data: requests at firewall and requests at webserver !number of webserver requests: ~4000
Honeypots are traps set to detect, deflect, or counteract
attempts at unauthorized use of information systems
some information from a typical ‘hit’ on the webserver:!IP address Country Request Timestamp101.227.4.25 CN /robots.txt 10/17/13 17:58:21
L-3
![Page 16: Data Tactics Data Science Brown Bag (April 2014)](https://reader034.fdocuments.net/reader034/viewer/2022042814/54c6a4e04a7959e4208b458a/html5/thumbnails/16.jpg)
Goals of Honeypot Analysis
16
• Categorize IP addresses in terms of similar requests!• Determine how requests vary by country!• Detect outliers
L-3
![Page 17: Data Tactics Data Science Brown Bag (April 2014)](https://reader034.fdocuments.net/reader034/viewer/2022042814/54c6a4e04a7959e4208b458a/html5/thumbnails/17.jpg)
Cluster Analysis
17
Grouping similar objects:
Requirements:! 1) Distance metric
!!
2) Method for grouping nearby objectsL-3
![Page 18: Data Tactics Data Science Brown Bag (April 2014)](https://reader034.fdocuments.net/reader034/viewer/2022042814/54c6a4e04a7959e4208b458a/html5/thumbnails/18.jpg)
Distance I: Combine Requests
18
1) Gather all unique requests invoked by each IP address:!! IP1 ⇒ { /, /robots.txt, …}! IP2 ⇒ { /HNAP1/, /manager/html, …} . . . . . .
L-3
![Page 19: Data Tactics Data Science Brown Bag (April 2014)](https://reader034.fdocuments.net/reader034/viewer/2022042814/54c6a4e04a7959e4208b458a/html5/thumbnails/19.jpg)
Distance II: Jaccard Similarity
19
Requests from IP address A: {♣,♦} ! Requests from IP address B: {,♦}
Jaccard Similarity:
intersection(A, B) = {♦} union(A, B) = {♣, ♦, }
J(A, B) = 1/3
Effective Distance: D = 1 - J0 1
D = 0 : A and B issue the same requestsD = 1 : A and B issue completely different requests
L-3
![Page 20: Data Tactics Data Science Brown Bag (April 2014)](https://reader034.fdocuments.net/reader034/viewer/2022042814/54c6a4e04a7959e4208b458a/html5/thumbnails/20.jpg)
Distance III: All Pairs
20
Calculate effective distance between all pairs of IP addresses !...leaving us with:
But usually in a high number of dimensions!
L-3
![Page 21: Data Tactics Data Science Brown Bag (April 2014)](https://reader034.fdocuments.net/reader034/viewer/2022042814/54c6a4e04a7959e4208b458a/html5/thumbnails/21.jpg)
Identifying Clusters
21
How many clusters are there?Are there outliers?
Density Based Clustering:
● connectivity
● densityL-3
![Page 22: Data Tactics Data Science Brown Bag (April 2014)](https://reader034.fdocuments.net/reader034/viewer/2022042814/54c6a4e04a7959e4208b458a/html5/thumbnails/22.jpg)
Connectivity
22
Distance
Threshold
Number of Clusters
Distance Threshold
3
2
1
Cluster 1Cluster 2
L-3
![Page 23: Data Tactics Data Science Brown Bag (April 2014)](https://reader034.fdocuments.net/reader034/viewer/2022042814/54c6a4e04a7959e4208b458a/html5/thumbnails/23.jpg)
Density Based Clustering
23
2 parameters: distance threshold and minimum number of neighbors (DBSCAN)
example:minimum number = 2
OutliersClusters
L-3
![Page 24: Data Tactics Data Science Brown Bag (April 2014)](https://reader034.fdocuments.net/reader034/viewer/2022042814/54c6a4e04a7959e4208b458a/html5/thumbnails/24.jpg)
Shiny App for Analysis
24L-3
![Page 25: Data Tactics Data Science Brown Bag (April 2014)](https://reader034.fdocuments.net/reader034/viewer/2022042814/54c6a4e04a7959e4208b458a/html5/thumbnails/25.jpg)
China
25
Dominant Requests of Each Cluster !❶ /robots.txt ❷ / ❸ /manager/html ❹ www.baidu.com/ !
L-3
![Page 26: Data Tactics Data Science Brown Bag (April 2014)](https://reader034.fdocuments.net/reader034/viewer/2022042814/54c6a4e04a7959e4208b458a/html5/thumbnails/26.jpg)
China
26
Other !/manager/html
Time (~34 Days)
Hits Over Time
Num
ber o
f Hits
L-3
![Page 27: Data Tactics Data Science Brown Bag (April 2014)](https://reader034.fdocuments.net/reader034/viewer/2022042814/54c6a4e04a7959e4208b458a/html5/thumbnails/27.jpg)
United States
27
10 Clusters (Malicious & Benign)
L-3
![Page 28: Data Tactics Data Science Brown Bag (April 2014)](https://reader034.fdocuments.net/reader034/viewer/2022042814/54c6a4e04a7959e4208b458a/html5/thumbnails/28.jpg)
United States
28
Clusters !Outliers
from same IP address
Num
ber o
f Hits
Time (~34 Days)
Hits Over Time
L-3
![Page 29: Data Tactics Data Science Brown Bag (April 2014)](https://reader034.fdocuments.net/reader034/viewer/2022042814/54c6a4e04a7959e4208b458a/html5/thumbnails/29.jpg)
Accomplishments
29
✓ Categorized behavior of IP addresses based on requests!✓ Detection of outliers
!✓ Determined how requests vary by country (China vs. USA)
L-3
![Page 30: Data Tactics Data Science Brown Bag (April 2014)](https://reader034.fdocuments.net/reader034/viewer/2022042814/54c6a4e04a7959e4208b458a/html5/thumbnails/30.jpg)
What Clustering Can Do for You
30
Objects + Attributes
Cluster the Objects Cluster the Attributes
Applications:!• IP addresses & their requests • patients & their symptoms • devices & their malfunctions • people & their associates
L-3
![Page 31: Data Tactics Data Science Brown Bag (April 2014)](https://reader034.fdocuments.net/reader034/viewer/2022042814/54c6a4e04a7959e4208b458a/html5/thumbnails/31.jpg)
Port Based Clustering
of Firewall Activity
By David Pekarek
L-3
![Page 32: Data Tactics Data Science Brown Bag (April 2014)](https://reader034.fdocuments.net/reader034/viewer/2022042814/54c6a4e04a7959e4208b458a/html5/thumbnails/32.jpg)
Firewall Activity Clustering Workflow
Data Preprocessing!and Vectorization
OPTICS !Clustering
Follow-on Investigations
Honey Pot!Firewall Activity
Aggregated counts of IP’s dest. port hits
Reachability Distance plot identifying user clusters and outliers
Characteristics of outlying IPs and IP clusters
#!#!#!#!#!#
Abc Abc Abc Abc Abc Abc
#!#!#!#!#!#
#!#!#!#!#!#
Abc Abc Abc Abc Abc Abc
~32K!logs
time, host, src IP, location, ports, protocol
#!#!#!#!#!#
#!#!#!#!#!#
#!#!#!#!#!#
~19K IPs
128 ports
#!#!#!#!#!#
#!#!#!#!#!#
outliers
clusters
Distinct activity levels on port
53
• The majority of source IPs make use of only one destination port!• 94% of source IPs fall into some cluster with similar port usage and
traffic volume
L-3
![Page 33: Data Tactics Data Science Brown Bag (April 2014)](https://reader034.fdocuments.net/reader034/viewer/2022042814/54c6a4e04a7959e4208b458a/html5/thumbnails/33.jpg)
OPTICS: Hierarchical Density Based Clustering
• Clustering algorithms provide a means to sort data without pre-existing labels!• Density-based clustering methods are robust in identifying clusters with non-
uniform shapes
• The OPTICS algorithm is a density-based approach that simultaneously evaluates cluster results at different scales
k-Means!results
Density-based!clustering !
results
Is this one cluster or two?!The answer depends on scale!
L-3
![Page 34: Data Tactics Data Science Brown Bag (April 2014)](https://reader034.fdocuments.net/reader034/viewer/2022042814/54c6a4e04a7959e4208b458a/html5/thumbnails/34.jpg)
OPTICS: How does it work?
• The OPTICS algorithm performs two major operations on the data:!• determining an ordering of all data points, based on the likelihood of points being
clustered together • assigning each point a Reachability Distance (R.D.): a quantification of the length
scale at which the given point will belong to any cluster • Plotting R.D. vs the ordered data points, clusters appear as troughs
Whole face
Eye EyeSmile
Outliers
L-3
![Page 35: Data Tactics Data Science Brown Bag (April 2014)](https://reader034.fdocuments.net/reader034/viewer/2022042814/54c6a4e04a7959e4208b458a/html5/thumbnails/35.jpg)
OPTICS: How was it applied?
Data Preprocessing!and Vectorization
OPTICS !Clustering
Follow-on Investigations
• Source IPs used as the identifier for entities with traffic hitting the honey pot firewall.
• Destination ports used to define the dimensions of feature space. Each of the 127 most common ports (those with at least 60 hits from the total population) got its own dimension. The remaining ‘rare’ ports bundled as a single dimension.
• OPTICS algorithm identified clusters of IPs in 128 dimensional space, with clustering results summarized in a 2-D reachability plot !
• Follow-on investigations performed to identify anomalous properties of outlying IPs and commonalities among clustered IPs
L-3
![Page 36: Data Tactics Data Science Brown Bag (April 2014)](https://reader034.fdocuments.net/reader034/viewer/2022042814/54c6a4e04a7959e4208b458a/html5/thumbnails/36.jpg)
Firewall Port Usage Clustering Results
L-3
![Page 37: Data Tactics Data Science Brown Bag (April 2014)](https://reader034.fdocuments.net/reader034/viewer/2022042814/54c6a4e04a7959e4208b458a/html5/thumbnails/37.jpg)
Firewall Port Usage Clustering Results
Clusters with some!distinctive activity
Outlying IPs!(Their activity falls into clusters only at
extremely generous length scales)
L-3
![Page 38: Data Tactics Data Science Brown Bag (April 2014)](https://reader034.fdocuments.net/reader034/viewer/2022042814/54c6a4e04a7959e4208b458a/html5/thumbnails/38.jpg)
Interactive Plotting Demo
Interactive Plotting Demo Here
L-3
![Page 39: Data Tactics Data Science Brown Bag (April 2014)](https://reader034.fdocuments.net/reader034/viewer/2022042814/54c6a4e04a7959e4208b458a/html5/thumbnails/39.jpg)
Port Usage Cluster Characterization
IPs with minimal activity on highly travelled ports
(22, 53, Other)
Outlying IPs:!Activity on multiple ports or very
seldom used ports
1-15 hits on port 80 (HTTP)
1-10 hits on port 3389 (RDP)
1-14 hits on port 1433 (MSSQL)
1-6 hits on port 445 (Active Directory)
Small clusters with activity !on less used ports!
(3306, 5060, 4899, 135, 25, 23, 45091, 48879, 1234)
L-3
![Page 40: Data Tactics Data Science Brown Bag (April 2014)](https://reader034.fdocuments.net/reader034/viewer/2022042814/54c6a4e04a7959e4208b458a/html5/thumbnails/40.jpg)
Port 53 Traffic Clustering Validation
OPTICS identifies the multimodal distribution of traffic to port 53 (DNS)
L-3
![Page 41: Data Tactics Data Science Brown Bag (April 2014)](https://reader034.fdocuments.net/reader034/viewer/2022042814/54c6a4e04a7959e4208b458a/html5/thumbnails/41.jpg)
Conclusions
• Destination ports show little correlation in the firewall logs. Source IPs tend to cluster by the one port to which they sent traffic.
• OPTICS clustering efficiently sorts source IPs as outliers or belonging to a cluster of common port usage.
• Interactive plotting tools allow for the rapid characterization of clusters.
L-3
![Page 42: Data Tactics Data Science Brown Bag (April 2014)](https://reader034.fdocuments.net/reader034/viewer/2022042814/54c6a4e04a7959e4208b458a/html5/thumbnails/42.jpg)
Latent Dirichlet Allocation: Characterizing
normal behavior and identifying deviations
from normality By Nathan Danneman
L-3
![Page 43: Data Tactics Data Science Brown Bag (April 2014)](https://reader034.fdocuments.net/reader034/viewer/2022042814/54c6a4e04a7959e4208b458a/html5/thumbnails/43.jpg)
Outline
• What is Latent Dirichlet Allocation (LDA)?
• How does it compare to other clustering tools?
• LDA by example: analyzing log files
L-3
![Page 44: Data Tactics Data Science Brown Bag (April 2014)](https://reader034.fdocuments.net/reader034/viewer/2022042814/54c6a4e04a7959e4208b458a/html5/thumbnails/44.jpg)
LDA is a Mixture Model
• Mixture Models: – Identify sets of variables that co-occur (behavioral patterns) – Determine what behavioral patterns each individual exhibits
• Example: The Sports Equipment Analogy
Golf Clubs Tennis Racket Golf Balls Tennis Balls Baseball Bat
John 12 4
Susan 14 1 6 3
Chris 2 3
Jane 1 11 1
L-3
![Page 45: Data Tactics Data Science Brown Bag (April 2014)](https://reader034.fdocuments.net/reader034/viewer/2022042814/54c6a4e04a7959e4208b458a/html5/thumbnails/45.jpg)
• Mixture Models: – Identify sets of variables that co-occur (behavioral patterns) – Determine what behavioral patterns each individual exhibits
• Example: The Sports Equipment Analogy
Golf Clubs Tennis Racket Golf Balls Tennis Balls Baseball Bat
John 12 4
Susan 14 1 6 3
Chris 2 3
Jane 1 11 1
LDA is a Mixture Model
L-3
![Page 46: Data Tactics Data Science Brown Bag (April 2014)](https://reader034.fdocuments.net/reader034/viewer/2022042814/54c6a4e04a7959e4208b458a/html5/thumbnails/46.jpg)
LDA Utilizes Soft Clustering
• Hard Clustering: every point is assigned to one group
• Hard Clustering with Outliers: every point is assigned to one or no groups
• Soft Clustering: every point is assigned to zero, one, or several groups. x1
x2
L-3
![Page 47: Data Tactics Data Science Brown Bag (April 2014)](https://reader034.fdocuments.net/reader034/viewer/2022042814/54c6a4e04a7959e4208b458a/html5/thumbnails/47.jpg)
x1
x2
Golf Clubs Tennis Racket Golf Balls Tennis Balls Baseball Bat
John 12 4
Susan 14 1 6 3
Chris 2 3
Jane 1 11 1
• Hard Clustering: every point is assigned to one group
• Hard Clustering with Outliers: every point is assigned to one or no groups
• Soft Clustering: every point is assigned to zero, one, or several groups.
LDA Utilizes Soft Clustering
L-3
![Page 48: Data Tactics Data Science Brown Bag (April 2014)](https://reader034.fdocuments.net/reader034/viewer/2022042814/54c6a4e04a7959e4208b458a/html5/thumbnails/48.jpg)
Input Data for LDA: Cyber Data
• LDA takes a matrix of counts
• Data: log files from a large network; 8700 users, 85 log types
• Each row represents a user
• Each column represents a log type
Connection!Success
Termination!Success
Invalid !Login
...
User 1 0 3 2
User 2 12 3 0
User 3 3 0 18
User 4 2 22 1
User 5 7 5 9
... ...
L-3
![Page 49: Data Tactics Data Science Brown Bag (April 2014)](https://reader034.fdocuments.net/reader034/viewer/2022042814/54c6a4e04a7959e4208b458a/html5/thumbnails/49.jpg)
LDA Estimates Two Mixtures
• Output 1: logs that co-occur, forming behavioral patterns
Log Type 1 !Log Type 2 !Log Type 3 !Log Type 4 ...
Behavioral Pattern 1 !!Behavioral Pattern 2
Each log relates to zero, one, or many behavioral patterns
L-3
![Page 50: Data Tactics Data Science Brown Bag (April 2014)](https://reader034.fdocuments.net/reader034/viewer/2022042814/54c6a4e04a7959e4208b458a/html5/thumbnails/50.jpg)
LDA Estimates Two Mixtures
• Output 1: logs that co-occur, forming behavioral patterns
• Output 2: which behavioral pattern(s) characterize each user
Log Type 1 !Log Type 2 !Log Type 3 !Log Type 4 ...
Behavioral Pattern 1 !!Behavioral Pattern 2
User 1 !User 2 !User 3 !User 4 ...
Users exhibit zero, one, or many behaviors
L-3
![Page 51: Data Tactics Data Science Brown Bag (April 2014)](https://reader034.fdocuments.net/reader034/viewer/2022042814/54c6a4e04a7959e4208b458a/html5/thumbnails/51.jpg)
LDA Workflow
• Build the N (observation) by P (log types) matrix of counts
• Use an empirical method to determine the optimal number of behavioral patterns to estimate
• Estimate the model
Connection (Successful)
Connection (Failure)
Termination (Successful)
Connection (Time-Out)
User1 15 15 0 3
User2 8 12 2 0
L-3
![Page 52: Data Tactics Data Science Brown Bag (April 2014)](https://reader034.fdocuments.net/reader034/viewer/2022042814/54c6a4e04a7959e4208b458a/html5/thumbnails/52.jpg)
Output 1: Mapping Log Types to Behavioral Patterns:Behavioral Pattern #3
Firewall.Connections.Successful
Firewall.Connections.Terminations
Firewall.Connections.Successful
Firewall.Connections.Terminations
L-3
![Page 53: Data Tactics Data Science Brown Bag (April 2014)](https://reader034.fdocuments.net/reader034/viewer/2022042814/54c6a4e04a7959e4208b458a/html5/thumbnails/53.jpg)
Output 1: Mapping Log Types to Behavioral Patterns:Behavioral Pattern #4
Firewall.Connections.Successful
Firewall.Connections.Terminations
Windows.Hosts.User.Logins
Windows.Hosts.User.Logoffs
Windows.Hosts.User.Privileged.Use.Successful
L-3
![Page 54: Data Tactics Data Science Brown Bag (April 2014)](https://reader034.fdocuments.net/reader034/viewer/2022042814/54c6a4e04a7959e4208b458a/html5/thumbnails/54.jpg)
Behavioral Pattern Characterization
• Behavioral Pattern 1: • Windows Hosts: Failed Logins
• Behavioral Pattern 2: • Firewall: Connections • Windows Hosts: Logins, Logoffs
• Behavioral Pattern 3: • Firewall: Connections, Terminations
• Behavioral Pattern 4: • Windows Hosts: Logins
• Behavioral Pattern 5: • Firewall: System Normal, Connections, Terminations
• Behavioral Pattern 6: • Web Logs: System Normal
• Behavioral Pattern 7: • Firewall: System Errors
Normal Activity
Abnormal Activity
Abnormal ActivityL-3
![Page 55: Data Tactics Data Science Brown Bag (April 2014)](https://reader034.fdocuments.net/reader034/viewer/2022042814/54c6a4e04a7959e4208b458a/html5/thumbnails/55.jpg)
Behavioral Pattern Characterization
Wind
ows H
osts:
Faile
d Lo
gins
Firew
all: C
onne
ction
s,
Term
inatio
ns
Wind
ows H
osts:
Login
sFir
ewall
: Sys
tem
Nor
mal,
Conn
ectio
ns, T
erm
inatio
ns
Web
Logs
: Sys
tem
Nor
mal
Firew
all: S
yste
m E
rrors
Firew
all: C
onne
ction
s
Wind
ows H
osts:
Login
s, Lo
goffs
L-3
![Page 56: Data Tactics Data Science Brown Bag (April 2014)](https://reader034.fdocuments.net/reader034/viewer/2022042814/54c6a4e04a7959e4208b458a/html5/thumbnails/56.jpg)
LDA Estimates Two Mixtures
• Output 1: logs that co-occur, forming behavioral patterns
• Output 2: which behavioral pattern(s) characterize each user
Log Type 1 !Log Type 2 !Log Type 3 !Log Type 4 ...
Behavioral Pattern 1 !!Behavioral Pattern 2
User 1 !User 2 !User 3 !User 4 ...
Users exhibit zero, one, or many behaviors
L-3
![Page 57: Data Tactics Data Science Brown Bag (April 2014)](https://reader034.fdocuments.net/reader034/viewer/2022042814/54c6a4e04a7959e4208b458a/html5/thumbnails/57.jpg)
Characterizing Users with Behavioral Patterns
User # 2
Essentially, entirely firewall connections and terminations.
L-3
![Page 58: Data Tactics Data Science Brown Bag (April 2014)](https://reader034.fdocuments.net/reader034/viewer/2022042814/54c6a4e04a7959e4208b458a/html5/thumbnails/58.jpg)
Characterizing Users with Behavioral Patterns
User # 43
Lots of failed logins!
Normal activity: connections, terminations, logins, logoffs
L-3
![Page 59: Data Tactics Data Science Brown Bag (April 2014)](https://reader034.fdocuments.net/reader034/viewer/2022042814/54c6a4e04a7959e4208b458a/html5/thumbnails/59.jpg)
Visualizing Two-Level Mixtures
Behavioral Pattern 3
Behavioral Pattern 4
A User Characterized by: 45% Behavior 3 and 55% Behavior 4
L-3
![Page 60: Data Tactics Data Science Brown Bag (April 2014)](https://reader034.fdocuments.net/reader034/viewer/2022042814/54c6a4e04a7959e4208b458a/html5/thumbnails/60.jpg)
Outlier Detection with LDA
• Mixture models make predictions about the proportion of each log type a user will have
• We can compare the predicted proportions to each user’s actual proportions to see how well the model captures each user’s actions
• Typical users should be well-characterized by mixtures of common behavioral patterns – these are “normal” users
• Users whose actions are not mixtures of common behavioral patterns are doing things that are uncommon – these are outliers
L-3
![Page 61: Data Tactics Data Science Brown Bag (April 2014)](https://reader034.fdocuments.net/reader034/viewer/2022042814/54c6a4e04a7959e4208b458a/html5/thumbnails/61.jpg)
Measuring User-Level Discrepancy
Cosine Similarity = 0.99
Proportions of All Log Types for a Single User
L-3
![Page 62: Data Tactics Data Science Brown Bag (April 2014)](https://reader034.fdocuments.net/reader034/viewer/2022042814/54c6a4e04a7959e4208b458a/html5/thumbnails/62.jpg)
Measuring User-Level Discrepancy
Cosine Similarity = 0.02
Proportions of All Log Types for a Single User
L-3
![Page 63: Data Tactics Data Science Brown Bag (April 2014)](https://reader034.fdocuments.net/reader034/viewer/2022042814/54c6a4e04a7959e4208b458a/html5/thumbnails/63.jpg)
Cosine Similarity between Predicted and Observed Data (all users)
~99% of users are well-explained
L-3
![Page 64: Data Tactics Data Science Brown Bag (April 2014)](https://reader034.fdocuments.net/reader034/viewer/2022042814/54c6a4e04a7959e4208b458a/html5/thumbnails/64.jpg)
Cosine Similarity between Predicted and Observed Data (poorly fit users)
L-3
![Page 65: Data Tactics Data Science Brown Bag (April 2014)](https://reader034.fdocuments.net/reader034/viewer/2022042814/54c6a4e04a7959e4208b458a/html5/thumbnails/65.jpg)
LDA Detects Univariate Outliers
One user had 77% Windows Hosts Failed Logins; mean for data is 0.002%
Proportion of Windows Hosts: Failed Login Logs
User # 12
L-3
![Page 66: Data Tactics Data Science Brown Bag (April 2014)](https://reader034.fdocuments.net/reader034/viewer/2022042814/54c6a4e04a7959e4208b458a/html5/thumbnails/66.jpg)
LDA Detects Conditional Outliers
User # 53 has a typical proportion of Firewall Termination logs... !
!!
However, User 53 has more than twice as many Firewall Terminations as users with his/her same proportion of Firewall Connections. !!
Percentage of Logs that are Firewall Terminations
Num
ber
of U
sers
User 53
Firewall Terminations comprise about 50% of many users’ logs
Percentage of Logs that are Firewall Terminations
Num
ber
of U
sers
Firewall Terminations among users with 53’s proportion of Firewall Connections
User 53
L-3
![Page 67: Data Tactics Data Science Brown Bag (April 2014)](https://reader034.fdocuments.net/reader034/viewer/2022042814/54c6a4e04a7959e4208b458a/html5/thumbnails/67.jpg)
Conclusions
• LDA allows an analyst to: – Succinctly characterize common behavioral patterns – Capture nuance through soft clustering – Identify both simple and conditional outliers !
• Next Steps: – Radically improve parallelized versions of LDA – Build enhanced visualizations that allow analysts to interact with data !
• Previous Steps:– Cyber IR&D II - Honeypots & Topic Graphs
– https://portal.data-tactics-corp.com/sites/analytics/Shared%20Documents/honeypots.pdf
L-3
![Page 68: Data Tactics Data Science Brown Bag (April 2014)](https://reader034.fdocuments.net/reader034/viewer/2022042814/54c6a4e04a7959e4208b458a/html5/thumbnails/68.jpg)
•Query based analytics are tenuous for data with large feature spaces and population sizes. For complete answers, we must analyze with comprehensive algorithms.
•Cyber systems regularly lack reliable (or stationary) models and priors. Hence we have been focused on questions of pattern detection (hard) and outlier detection (harder) for big cyber data, primarily obtaining results via clustering analyses.
•There are many, many clustering algorithms, each with distinct features and requirements (No Free Lunch for Theorems). Choosing the most appropriate tool requires a deep understanding of the available data, the questions at hand, and the pros and cons of applicable methods.
Final Thoughts…
L-3
![Page 69: Data Tactics Data Science Brown Bag (April 2014)](https://reader034.fdocuments.net/reader034/viewer/2022042814/54c6a4e04a7959e4208b458a/html5/thumbnails/69.jpg)
•L-3 Data Tactics has several minimally viable products (MVP) working of very hard elements of the cyber analytics problem set.
•These MVPs can be used in a support function to existing security protocol and signature based systems - or provide those systems already in place with pattern and anomaly detection.
•Previous and future honeypot collection will further define L-3’s cyber competencies in proactive cyber analytics.
Final Thoughts…
L-3
![Page 70: Data Tactics Data Science Brown Bag (April 2014)](https://reader034.fdocuments.net/reader034/viewer/2022042814/54c6a4e04a7959e4208b458a/html5/thumbnails/70.jpg)
...on the Horizon: !Honeypots and Twitter Collection Platforms !Summer Data Science Internship Program (Robert R. & USMA cadets): Honeypots analytical application development USA Civil Affairs & CERDEC Analytics http://glimmer.rstudio.com/gosystems01/Stability/ Next Data Science Brown Bag late July. !DS4G & DS4PM both making appearances this year. !Data Science on display at the L-3 Technology Exchange 2014… more to come.
… on the horizon.
L-3
![Page 73: Data Tactics Data Science Brown Bag (April 2014)](https://reader034.fdocuments.net/reader034/viewer/2022042814/54c6a4e04a7959e4208b458a/html5/thumbnails/73.jpg)
Homepage: http://www.data-tactics.comBlog: http://datatactics.blogspot.com
Twitter: https://twitter.com/rheimann
Or, me (Rich Heimann) at [email protected]
Questions?
L-3
Twitter: https://twitter.com/DataTactics
Twitter: https://twitter.com/mwatsonTwitter: https://twitter.com/ndanneman