Cloudy with a Chance of Breach: Forecasting Cyber Security ......Cloudy with a Chance of Breach:...
Transcript of Cloudy with a Chance of Breach: Forecasting Cyber Security ......Cloudy with a Chance of Breach:...
Cloudy with a Chance of Breach:Forecasting Cyber Security Incidents
Yang Liu§, Armin Sarabi§, Jing Zhang§, Parinaz Naghizadeh§
Manish Karir], Michael Bailey∗, Mingyan Liu§,]
§ EECS Department, University of Michigan, Ann Arbor] QuadMetrics, Inc.
∗ ECE Department, University of Illinois, Urbana-Champaign
http://grs.eecs.umich.edu
Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 1 / 28
Intro Introduction
Motivation
Increasingly frequent and high-impact data breaches
I Target, JP Morgan Chase,Home Depot, to name a few
I Increasing social and economicimpact of such cyber incidents
Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 2 / 28
Intro Introduction
Limitation of current approaches
I Heavily detection based
I Fail to detect, or too late by the time a breach is detected
I Not suited for cost/damage control
I Urgent need for more proactive measures
Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 3 / 28
Intro Introduction
Detection
I analogous to diagnosing apatient who may already be ill(e.g., by using biopsy).
I [Qian et al. NDSS14, Wanget al. USENIX Sec14]
Prediction
I predicting whether a presentlyhealthy person may become illbased on a variety of relevantfactors.
I [Soska & Christin, USENIXSec14]
Our goal:
I Understand the extent to which one can forecast incidents on anorganizational level.
Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 4 / 28
Intro Introduction
Detection
I analogous to diagnosing apatient who may already be ill(e.g., by using biopsy).
I [Qian et al. NDSS14, Wanget al. USENIX Sec14]
Prediction
I predicting whether a presentlyhealthy person may become illbased on a variety of relevantfactors.
I [Soska & Christin, USENIXSec14]
Our goal:
I Understand the extent to which one can forecast incidents on anorganizational level.
Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 4 / 28
Intro Introduction
Objective
To develop the ability to forecast security incidences
I Applicability: we rely solely on externally observed data; do notrequire information on the internal workings of a network or itshosts.
I Robustness: we do not have control over or direct knowledge ofthe error embedded in the data.
Key idea:
I tap into a diverse set of data that captures different aspects of anetwork’s security posture, ranging from the explicit to latent.
Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 5 / 28
Intro Introduction
Objective
To develop the ability to forecast security incidences
I Applicability: we rely solely on externally observed data; do notrequire information on the internal workings of a network or itshosts.
I Robustness: we do not have control over or direct knowledge ofthe error embedded in the data.
Key idea:
I tap into a diverse set of data that captures different aspects of anetwork’s security posture, ranging from the explicit to latent.
Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 5 / 28
Intro Introduction
Objective
To develop the ability to forecast security incidences
I Applicability: we rely solely on externally observed data; do notrequire information on the internal workings of a network or itshosts.
I Robustness: we do not have control over or direct knowledge ofthe error embedded in the data.
Key idea:
I tap into a diverse set of data that captures different aspects of anetwork’s security posture, ranging from the explicit to latent.
Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 5 / 28
Intro Introduction
Why prediction?
Forecast enables entirely new classes of applications which areotherwise not feasible.
I Prediction allows proactive policies and measures to be adoptedrather than reactive measures following the detection.
Forecast enables effective risk management schemes
I Internal to an org.: more informed decisions on resourceallocation.
I External to an org.: incentive mechanisms such as cyberinsurance.
Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 6 / 28
Intro Introduction
Why prediction?
Forecast enables entirely new classes of applications which areotherwise not feasible.
I Prediction allows proactive policies and measures to be adoptedrather than reactive measures following the detection.
Forecast enables effective risk management schemes
I Internal to an org.: more informed decisions on resourceallocation.
I External to an org.: incentive mechanisms such as cyberinsurance.
Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 6 / 28
Intro Introduction
Why prediction?
Forecast enables entirely new classes of applications which areotherwise not feasible.
I Prediction allows proactive policies and measures to be adoptedrather than reactive measures following the detection.
Forecast enables effective risk management schemes
I Internal to an org.: more informed decisions on resourceallocation.
I External to an org.: incentive mechanisms such as cyberinsurance.
Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 6 / 28
Intro Introduction
Why prediction?
Forecast enables entirely new classes of applications which areotherwise not feasible.
I Prediction allows proactive policies and measures to be adoptedrather than reactive measures following the detection.
Forecast enables effective risk management schemes
I Internal to an org.: more informed decisions on resourceallocation.
I External to an org.: incentive mechanisms such as cyberinsurance.
Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 6 / 28
Intro Introduction
Why prediction?
Forecast enables entirely new classes of applications which areotherwise not feasible.
I Prediction allows proactive policies and measures to be adoptedrather than reactive measures following the detection.
Forecast enables effective risk management schemes
I Internal to an org.: more informed decisions on resourceallocation.
I External to an org.: incentive mechanisms such as cyberinsurance.
Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 6 / 28
Intro Introduction
Outline of the talk
I Data and Preliminaries
- Description of the data- Data pre-processing
I Forecasting methods
- Construction of the predictor
I Forecasting results
- Main prediction results & analysis
Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 7 / 28
Data Methodology
Datasets at a glance
Category Collection period Datasets
Mismanagement Feb’13 - Jul’13 Open Recursive Resolvers, DNS Source Port,symptoms BGP misconfiguration, Untrusted HTTPS,
Open SMTP Mail RelaysMalicious May’13 - Dec’14 CBL, SBL, SpamCop, UCEPROTECT,activities WPBL, SURBL, PhishTank, hpHosts,
Darknet scanners list, Dshield, OpenBLIncident Aug’13 - Dec’14 VERIS Community Database,reports Hackmageddon, Web Hacking Incidents
I Mismanagement and malicious activities used to extract features.
I Incident reports used to generate labels for training and testing.
Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 8 / 28
Data Methodology
Security posture data
Mismanagement symptoms
I Deviation from known best practices; indicators of lack of policyor expertise:
- Misconfigured- HTTPS cert, DNS (resolver+source port), mailserver, BGP.
I Collected around mid-2013 (pre-incidnts).
Malicious Activity Data: a set of 11 reputation blacklists (RBLs)
I Daily collections of IPs seen engaged in some malicious activity.
I Three malicious activity types: spam, phishing, scan.
I Use data between May 2013 and December 2014.
Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 9 / 28
Data Methodology
Security posture data
Mismanagement symptoms
I Deviation from known best practices; indicators of lack of policyor expertise:
- Misconfigured- HTTPS cert, DNS (resolver+source port), mailserver, BGP.
I Collected around mid-2013 (pre-incidnts).
Malicious Activity Data: a set of 11 reputation blacklists (RBLs)
I Daily collections of IPs seen engaged in some malicious activity.
I Three malicious activity types: spam, phishing, scan.
I Use data between May 2013 and December 2014.
Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 9 / 28
Data Methodology
Security incident Data
Three incident datasets
I Hackmageddon
I Web Hacking Incidents Database (WHID)
I VERIS Community Database (VCDB)
Incident type SQLi Hijacking Defacement DDoS
Hackmageddon 38 9 97 59WHID 12 5 16 45
Incident type Crimeware Cyber Esp. Web app. ElseVCDB 59 16 368 213
Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 10 / 28
Data Data Pre-processing
Data Pre-processing
Incident cleaning.
I Remove irrelevant cases, e.g., robbery at liquor store, somethinghappened etc.
Data diversity presents challenge in alignment in time and space.
I Security posture records information at the host IP-address level.
I Cyber incident reports associated with an organization.
I Such alignment is not travial: reallocation makes boundaryunclear.
A mapping process:
I Summarizing owner IDs from RIR databases.
I 4.4 million prefixes listed under 2.6 million owner IDs: finerdegree compared to routing table.
I Sample IP from organization + search in above table.
Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 11 / 28
Data Data Pre-processing
Data Pre-processing
Incident cleaning.
I Remove irrelevant cases, e.g., robbery at liquor store, somethinghappened etc.
Data diversity presents challenge in alignment in time and space.
I Security posture records information at the host IP-address level.
I Cyber incident reports associated with an organization.
I Such alignment is not travial: reallocation makes boundaryunclear.
A mapping process:
I Summarizing owner IDs from RIR databases.
I 4.4 million prefixes listed under 2.6 million owner IDs: finerdegree compared to routing table.
I Sample IP from organization + search in above table.
Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 11 / 28
Data Data Pre-processing
Data Pre-processing
Incident cleaning.
I Remove irrelevant cases, e.g., robbery at liquor store, somethinghappened etc.
Data diversity presents challenge in alignment in time and space.
I Security posture records information at the host IP-address level.
I Cyber incident reports associated with an organization.
I Such alignment is not travial: reallocation makes boundaryunclear.
A mapping process:
I Summarizing owner IDs from RIR databases.
I 4.4 million prefixes listed under 2.6 million owner IDs: finerdegree compared to routing table.
I Sample IP from organization + search in above table.
Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 11 / 28
Forecast
Outline of the talk
I Data and Preliminaries
- Description of the data- Data pre-processing
I Forecasting methods
- Construction of the predictor
I Forecasting results
- Main prediction results & analysis
Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 12 / 28
Forecast Methodology
Approach at a glance
Feature extraction
I 258 features extracted from the datasets: Primary + Secondaryfeatures.
Label generation
I 1,000+ incident reports from the three incident sets
Classifier training and testing
I Random Forest (RF) classifier trained with features and labels.
Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 13 / 28
Forecast Methodology
Approach at a glance
Feature extraction
I 258 features extracted from the datasets: Primary + Secondaryfeatures.
Label generation
I 1,000+ incident reports from the three incident sets
Classifier training and testing
I Random Forest (RF) classifier trained with features and labels.
Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 13 / 28
Forecast Methodology
Approach at a glance
Feature extraction
I 258 features extracted from the datasets: Primary + Secondaryfeatures.
Label generation
I 1,000+ incident reports from the three incident sets
Classifier training and testing
I Random Forest (RF) classifier trained with features and labels.
Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 13 / 28
Forecast Methodology
Primary features: raw data
Mismanagement symptoms (5).
I Five symptoms; each measures a fraction
I Predictive power of these symptoms.
0 0.5 10
0.5
1
% Untrusted HTTPS
CD
F
Victim org.Non−victim org.
0 0.2 0.40
0.5
1
% openresolverC
DF
Victim org.Non−victim org.
Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 14 / 28
Forecast Methodology
Malicious activity time series (60 × 3).
I Three time series over a period: spam, phishing, scan.
I Recent 60 v.s. Recent 14.
10 20 30 40 50 600
1
2
3
4
Days10 20 30 40 50 60
400
600
800
1k
Days10 20 30 40 50 60
2k
4k
6k
8k
10k
Days
Size: number of IPs in an aggregation unit (1)
I To some extent capture the likelihood of an organizationbecoming a target of/reproting intentional attacks.
Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 15 / 28
Forecast Methodology
Malicious activity time series (60 × 3).
I Three time series over a period: spam, phishing, scan.
I Recent 60 v.s. Recent 14.
10 20 30 40 50 600
1
2
3
4
Days10 20 30 40 50 60
400
600
800
1k
Days10 20 30 40 50 60
2k
4k
6k
8k
10k
Days
Size: number of IPs in an aggregation unit (1)
I To some extent capture the likelihood of an organizationbecoming a target of/reproting intentional attacks.
Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 15 / 28
Forecast Methodology
Secondary features
Quantization and feature extraction
10 20 30 40 50 603k
4k
5k
6k
7k
8k
9k
Days
# of
IPs
liste
d
Persistency
I Measure security efforts and responsiveness.
I In each quantized region, measure average magnitude, averageduration, and frequency.
Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 16 / 28
Forecast Methodology
A look at their predictive power (using data from Nov-Dec’13):
0 200 4000
0.5
1
Un−normalized "bad" magnitude
CD
F
Victim org.Non−victim org.
0 0.5 10
0.5
1
Normalized "good" magnitude
CD
F
Victim org.Non−victim org.
0 10 20 300
0.5
1
"Bad" duration
CD
F
Victim org.Non−victim org.
0 0.5 10
0.5
1
"Bad" frequency
CD
F
Victim org.Non−victim org.
Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 17 / 28
Forecast Overview of method
Training subjects
A subset victim organizations, Group(1) or incident group.
I Training-testing ratio, e.g., 70-30 or 50-50 split .
I Split strictly according to time: use past to predict future.
Hackmageddon VCDB WHID
Training Oct 13 – Dec 13 Aug 13 – Dec 13 Jan 14 – Mar 14Testing Jan 14 – Feb 14 Jan 14 – Dec 14 Apr 14 – Nov 14
A random subset of non-victims, Group (0) or non-incident group.
I Random sub-sampling necessary to avoid imbalance; procedure isrepeated over different random subsets.
Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 18 / 28
Forecast Overview of method
Training subjects
A subset victim organizations, Group(1) or incident group.
I Training-testing ratio, e.g., 70-30 or 50-50 split .
I Split strictly according to time: use past to predict future.
Hackmageddon VCDB WHID
Training Oct 13 – Dec 13 Aug 13 – Dec 13 Jan 14 – Mar 14Testing Jan 14 – Feb 14 Jan 14 – Dec 14 Apr 14 – Nov 14
A random subset of non-victims, Group (0) or non-incident group.
I Random sub-sampling necessary to avoid imbalance; procedure isrepeated over different random subsets.
Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 18 / 28
Results
Outline of the talk
I Data and Preliminaries
- Description of the data- Data pre-processing
I Forecasting methods
- Construction of the predictor
I Forecasting results
- Main prediction results & analysis
Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 19 / 28
Results Main results
Prediction procedure
Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 20 / 28
Results Main results
Prediction procedure
Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 20 / 28
Results Main results
Prediction procedure
Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 20 / 28
Results Main results
Prediction performance
0.1 0.2 0.3 0.4 0.50.4
0.5
0.6
0.7
0.8
0.9
1
False positive
Tru
e po
sitiv
e
VCDBHackmageddonWHIDALL
Example of desirable operating points of the classifier:
Accuracy Hackmageddon VCDB WHID All
True Positive (TP) 96% 88% 80% 88%False Positive (FP) 10% 10% 5% 4%Overall Accuracy 90% 90% 95% 96%
Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 21 / 28
Results Other observations
Split ratio
0.1 0.2 0.3 0.4 0.50.4
0.5
0.6
0.7
0.8
0.9
1
False positive
Tru
e p
ositiv
e
VCDB: 50−50 & Short
VCDB: 70−30 & Short
More training data better performance.
Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 22 / 28
Results Other observations
Long term prediction
Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 23 / 28
Results Other observations
Short term v.s. long term prediction
0.1 0.2 0.3 0.4 0.50.4
0.5
0.6
0.7
0.8
0.9
1
False positive
Tru
e p
ositiv
e
VCDB: 50−50 & Short
VCDB: 50−50 & Long
Temporal features become outdated.
Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 24 / 28
Results Other observations
Importance of the Features
Top feature descriptor Value
Untrusted HTTPS Certificates 0.1531Frequency 0.1089Organization size 0.0976Open recursive resolver 0.0928
I Two mismgmt features rank in top 4.
Feature category Normalized importance
Mismanagement 0.3229Time series data 0.2994Recent-60 secondary features 0.2602
I Secondary features almost as important as time series data.
I Dynamic features > static features.
I Separate data does NOT achieve comparable results.
Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 25 / 28
Results Other observations
Importance of the Features
Top feature descriptor Value
Untrusted HTTPS Certificates 0.1531Frequency 0.1089Organization size 0.0976Open recursive resolver 0.0928
I Two mismgmt features rank in top 4.
Feature category Normalized importance
Mismanagement 0.3229Time series data 0.2994Recent-60 secondary features 0.2602
I Secondary features almost as important as time series data.
I Dynamic features > static features.
I Separate data does NOT achieve comparable results.
Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 25 / 28
Results Other observations
Importance of the Features
Top feature descriptor Value
Untrusted HTTPS Certificates 0.1531Frequency 0.1089Organization size 0.0976Open recursive resolver 0.0928
I Two mismgmt features rank in top 4.
Feature category Normalized importance
Mismanagement 0.3229Time series data 0.2994Recent-60 secondary features 0.2602
I Secondary features almost as important as time series data.
I Dynamic features > static features.
I Separate data does NOT achieve comparable results.
Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 25 / 28
Results Other observations
Importance of the Features
Top feature descriptor Value
Untrusted HTTPS Certificates 0.1531Frequency 0.1089Organization size 0.0976Open recursive resolver 0.0928
I Two mismgmt features rank in top 4.
Feature category Normalized importance
Mismanagement 0.3229Time series data 0.2994Recent-60 secondary features 0.2602
I Secondary features almost as important as time series data.
I Dynamic features > static features.
I Separate data does NOT achieve comparable results.
Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 25 / 28
Results Other observations
Case study: Data Breaches of 2014
0.4 0.5 0.6 0.7 0.8 0.9 10
0.2
0.4
0.6
0.8
1
Predictor output
CD
F
Randomly selected non−victim setVCDB victim set
AXTEL 0.87
Homedepot 0.85BJP Junagadh 0.77
Threshold 0.85
Threshold 0.85
Target 0.84
ACME 0.85 Sony picture 0.90OnlineTech 0.92Ebay 0.88
I High profile data breaches from 2014: Sony (0.9), Ebay (0.88),Homedepot (0.85), Target (0.84), OnlineTech/JP Morgan Chase(0.92)
Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 26 / 28
Discussion Discussions
Discussions
Errors in the data.
Robustness against advasarial data.
Prediction by incident type.I O. Thonnard, L. Bilge, A. Kashyap, and M.Lee, Are You At Risk? Profiling
Organizations and Individuals Subject to Targeted Attacks. FinancialCryptography and Data Security 2015.
I A. Sarabi, P. Naghizadeh, Y. Liu and M. Liu, Prioritizing Security Spending:A Quantitative Analysis of Risk Distributions for Different Business Profiles,WEIS 2015.
Quality of reported data.I Part of our data can be downladed here: http://grs.eecs.umich.edu.
Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 27 / 28
Discussion Discussions
Discussions
Errors in the data.
Robustness against advasarial data.
Prediction by incident type.I O. Thonnard, L. Bilge, A. Kashyap, and M.Lee, Are You At Risk? Profiling
Organizations and Individuals Subject to Targeted Attacks. FinancialCryptography and Data Security 2015.
I A. Sarabi, P. Naghizadeh, Y. Liu and M. Liu, Prioritizing Security Spending:A Quantitative Analysis of Risk Distributions for Different Business Profiles,WEIS 2015.
Quality of reported data.I Part of our data can be downladed here: http://grs.eecs.umich.edu.
Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 27 / 28
Discussion Discussions
Discussions
Errors in the data.
Robustness against advasarial data.
Prediction by incident type.I O. Thonnard, L. Bilge, A. Kashyap, and M.Lee, Are You At Risk? Profiling
Organizations and Individuals Subject to Targeted Attacks. FinancialCryptography and Data Security 2015.
I A. Sarabi, P. Naghizadeh, Y. Liu and M. Liu, Prioritizing Security Spending:A Quantitative Analysis of Risk Distributions for Different Business Profiles,WEIS 2015.
Quality of reported data.I Part of our data can be downladed here: http://grs.eecs.umich.edu.
Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 27 / 28
Discussion Discussions
Discussions
Errors in the data.
Robustness against advasarial data.
Prediction by incident type.I O. Thonnard, L. Bilge, A. Kashyap, and M.Lee, Are You At Risk? Profiling
Organizations and Individuals Subject to Targeted Attacks. FinancialCryptography and Data Security 2015.
I A. Sarabi, P. Naghizadeh, Y. Liu and M. Liu, Prioritizing Security Spending:A Quantitative Analysis of Risk Distributions for Different Business Profiles,WEIS 2015.
Quality of reported data.I Part of our data can be downladed here: http://grs.eecs.umich.edu.
Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 27 / 28
Discussion Discussions
Q & A
Acknowledgement
I We thank NSF and DHS for fundings.
Project webpage (part of data being available)
I http://grs.eecs.umich.edu
I http://www.umich.edu/~youngliu
Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 28 / 28