Applied Machine Learning in Software Security · Applied Machine Learning in Software Security ......
Transcript of Applied Machine Learning in Software Security · Applied Machine Learning in Software Security ......
1Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University
Approved for Public Release; Distribution is Unlimited
Software Engineering InstituteCarnegie Mellon UniversityPittsburgh, PA 15213
© 2016 Carnegie Mellon UniversityApproved for Public Release; Distribution is Unlimited
Applied Machine Learning in Software SecurityEliezer Kanal
2Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University
Approved for Public Release; Distribution is Unlimited
Copyright 2017 Carnegie Mellon University
This material is based upon work funded and supported by the Department of Defense under Contract No. FA8721-05-C-0003 with Carnegie Mellon University for the operation of the Software Engineering Institute, a federally funded research and development center.
NO WARRANTY. THIS CARNEGIE MELLON UNIVERSITY AND SOFTWARE ENGINEERING INSTITUTE MATERIAL IS FURNISHED ON AN “AS-IS” BASIS. CARNEGIE MELLON UNIVERSITY MAKES NO WARRANTIES OF ANY KIND, EITHER EXPRESSED OR IMPLIED, AS TO ANY MATTER INCLUDING, BUT NOT LIMITED TO, WARRANTY OF FITNESS FOR PURPOSE OR MERCHANTABILITY, EXCLUSIVITY, OR RESULTS OBTAINED FROM USE OF THE MATERIAL. CARNEGIE MELLON UNIVERSITY DOES NOT MAKE ANY WARRANTY OF ANY KIND WITH RESPECT TO FREEDOM FROM PATENT, TRADEMARK, OR COPYRIGHT INFRINGEMENT.
[Distribution Statement A] This material has been approved for public release and unlimited distribution. Please see Copyright notice for non-US Government use and distribution.
This material may be reproduced in its entirety, without modification, and freely distributed in written or electronic form without requesting formal permission. Permission is required for any other use. Requests for permission should be directed to the Software Engineering Institute at [email protected].
Carnegie Mellon® and CERT® are registered marks of Carnegie Mellon University.
DM-0004563
3Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University
Approved for Public Release; Distribution is Unlimited
Tom Mitchell, former CMU Machine Learning department chair:
The field of Machine Learning asks the question, “How can we build computer systems that automatically improve with experience, and what are the fundamental laws that govern all learning processes?”
Machine Learning seeks to automate data analysis and inference.
What is Machine Learning?
4Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University
Approved for Public Release; Distribution is Unlimited
What is Machine Learning?
If your problem can be stated as either of the following:
…you would likely benefit from machine learning.
Iwouldliketouse_____datatopredict_____.Iwouldliketouse____data
toguesswhat____is.
5Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University
Approved for Public Release; Distribution is Unlimited
What is Machine Learning?
Sample Techniques:
• Regression
• K-Means Clustering
Clustering image: Weston.pace, https://commons.wikimedia.org/wiki/File:K_Means_Example_Step_4.svg
6Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University
Approved for Public Release; Distribution is Unlimited
What is Machine Learning?
Feature Engineering:Using existing data to create more informative data
DataTypes
Image StaticVideo
Timeseries FinancialdataEvent counts
Structuredtext WebformsStructureddata
(JSON,XML)Sourcecode
Freetext NewsTweetsEmail
manymore…
7Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University
Approved for Public Release; Distribution is Unlimited
What is Machine Learning?
Examples:
o I would like to use incident ticket data to predictcustomer needs .
o I would like to use publicly available code to predict what code I will write .
o I would like to use bug report data to guess the location of undetected bugs in my code .
8Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University
Approved for Public Release; Distribution is Unlimited
9Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University
Approved for Public Release; Distribution is Unlimited
What is Machine Learning?
Examples:
o I would like to use incident ticket data to predictcustomer needs .
o I would like to use publicly available code to predict what code I will write .
o I would like to use bug report data to guess the location of undetected bugs in my code .
10Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University
Approved for Public Release; Distribution is Unlimited
Applied ML:Vulnerability Detection
11Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University
Approved for Public Release; Distribution is Unlimited
Applied ML: Vulnerability Detection
Analyzer
Analyzer
Analyzer
Codebases
Alerts Today3,147
11,772
48,690
0
10,000
20,000
30,000
40,000
50,000
60,000
TP FP Susp
Manyalertsleftunaudited!
12Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University
Approved for Public Release; Distribution is Unlimited
Applied ML: Vulnerability Detection
Analyzer
Analyzer
Analyzer
Codebases
Alerts Today
OurGoal
3,147
11,772
48,690
0
10,000
20,000
30,000
40,000
50,000
60,000
TP FP Susp
66effortdays
12,076
45,172
6,361
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
45,000
50,000
e-TP e-FP I
AutomatedStatisticalClassifier
• ExpectedTruePositive(e-TP)• ExpectedFalsePositive(e-FP)• Indeterminate(I)
13Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University
Approved for Public Release; Distribution is Unlimited
Applied ML: Vulnerability Detection
ClassifiersLassoLogisticRegressionCARTRandomForestExtremeGradientBoosting(XGBoost)
Some ofthefeatures usedAnalysistoolsused Tokensinfunc/methodSignificantLOC Alertsinfunc/methodComplexity AlertsinfileCoupling MethodsinfileCohesion SLOCinfileSEIcodingrule Avg TokensFunction/methodlength Avg SLOCSLOCinfunc/method Depthincoderepository#parametersinfunc/meth.
Cyclomatic complexity(func/meth)
14Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University
Approved for Public Release; Distribution is Unlimited
Applied ML: Vulnerability Detection
Significant improvement!
• 91% Classifier accuracy overall
• Specific rule accuracy at right
• 10x developer time saved!
RuleID LassoLRRandomForest CART XGBoost
INT31-C 98% 97% 98% 97%EXP01-J 74% 74% 81% 74%OBJ03-J 73% 86% 86% 83%FIO04-J* 80% 80% 90% 80%EXP33-C* 83% 87% 83% 83%EXP34-C* 67% 72% 79% 72%DCL36-C* 100% 100% 100% 100%ERR08-J* 99% 100% 100% 100%IDS00-J* 96% 96% 96% 96%ERR01-J* 100% 100% 100% 100%ERR09-J* 100% 88% 88% 88%
* Small quantity of data
RuleID LassoLRRandomForest CART XGBoost
INT31-C 98% 97% 98% 97%EXP01-J 74% 74% 81% 74%OBJ03-J 73% 86% 86% 83%FIO04-J* 80% 80% 90% 80%EXP33-C* 83% 87% 83% 83%EXP34-C* 67% 72% 79% 72%DCL36-C* 100% 100% 100% 100%ERR08-J* 99% 100% 100% 100%IDS00-J* 96% 96% 96% 96%ERR01-J* 100% 100% 100% 100%ERR09-J* 100% 88% 88% 88%
15Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University
Approved for Public Release; Distribution is Unlimited
Applied ML:Malware family classification
16Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University
Approved for Public Release; Distribution is Unlimited
Applied ML: Malware family classification
ReverseEngineering
Discovery
Refinement
Reflection
File
NewFamily
ArtifactCatalog
Signature 1Signature 2Signature 3
…
Files 1a 1b 1c 1d …Files 2a 2b 2c 2d …Files 3a 3b 3c 3d …
…
Whichfileshang
together?
17Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University
Approved for Public Release; Distribution is Unlimited
Applied ML: Malware family classification
SignalFlowgraphhighlightsbehaviorrelatingdifferent
malwarefamilies
Programinstructionanalysisshowssimilarityanddiversionofbehavior
StaticAnalysisidentifiesprogramswithsimilarsourcecode
18Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University
Approved for Public Release; Distribution is Unlimited
Applied ML: Malware family classification
Simplify visualization of extremely complex data through the use of dimensionality reduction and associated visualization techniques
Chernoff faceexperiment
t-SNE
19Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University
Approved for Public Release; Distribution is Unlimited
Applied ML: Malware family classification
• Ground Truth: SVM trained with expert ground truth labels.
• Turkers Avg: Classifier trained with layperson labels.
Performance surprisingly similar!
20Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University
Approved for Public Release; Distribution is Unlimited
Applied ML:Software cost estimation
RobertFerguson,DennisGoldenson,JamesMcCurley,RobertW.Stoddard,DavidZubrow,DebraAnderson.“QuantifyingUncertaintyinEarlyLifecycleCostEstimation(QUELCE)”.Dec2011.http://resources.sei.cmu.edu/library/asset-view.cfm?assetid=10039
21Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University
Approved for Public Release; Distribution is Unlimited
Applied ML: Software cost estimation
GeneralAccountingOffice.DefenseAcquisitions:AKnowledge-BasedFundingApproachCouldImproveMajorWeaponSystemProgramOutcomes.ReporttotheCommitteeonArmedServices,U.S.Senate,July2008,GAO-08-619.
22Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University
Approved for Public Release; Distribution is Unlimited
Applied ML: Software cost estimation
23Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University
Approved for Public Release; Distribution is Unlimited
Applied ML: Software cost estimation
24Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University
Approved for Public Release; Distribution is Unlimited
Applied ML: Software cost estimation
25Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University
Approved for Public Release; Distribution is Unlimited
Applied ML:Incident report mapping
26Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University
Approved for Public Release; Distribution is Unlimited
Applied ML: Incident report mapping
International partners
Agency infrastructure
Threat actors
Phishing campaign
Agency response
teams US-CERT infrastructure
Malware class
Compromised website
Common reference websites
Observation
Landscape Tickets
Inference
27Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University
Approved for Public Release; Distribution is Unlimited
Applied ML: Incident report mapping
28Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University
Approved for Public Release; Distribution is Unlimited
Indicators across ticketsIndicatorsoccurwithdiversepatternsacrosstickets,reportersandtime.Timeonxaxis,countonyaxis,colorcodedbyreporter.
MaliciousIP
AgencyIP
US-CERTdomain
29Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University
Approved for Public Release; Distribution is Unlimited
Similarity of indicatorsBeginningwithareferenceindicator,wefindindicatorssimilartoit.Example:amaliciousIP
• Coloredcirclesaretickets• Greycirclesareindicators• Largeindicatorsnear
centerofcirclehavesimilaroccurrencepatternstothereferenceindicator.
30Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University
Approved for Public Release; Distribution is Unlimited
Indicator communitiesButwhatifwearen’tstartingwithareferenceindicator?Weassumethatindicatorsgeneratedbyacoherentrealworldprocesswillbemorelikelytoco-occurinticketsthanarbitrarypairsofindicators.Findgroupsofhighlysimilarindicatorsincompleteindicator-ticketgraph.
31Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University
Approved for Public Release; Distribution is Unlimited
Indicator-ticket graph
Asubsetoftheticket-indicatorgraph(forasmallsetofselectedindicators)• Ticketsaregreytriangles• Indicatorsareblackcircles• Edgesconnectticketstotheindicatorsthey
contain
32Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University
Approved for Public Release; Distribution is Unlimited
Contact Information
Eliezer KanalTechnical ManagerTelephone: +1 412.268.5204Email: [email protected]