AI approach to malware similarity analysis: Maping the malware genome with a deep neural network

15
An AI Approach to Malware Similarity Analysis: Mapping the Malware Genome With a Deep Neural Network Dr. Konstantin Berlin Senior Research Engineer Invincea Labs 1

Transcript of AI approach to malware similarity analysis: Maping the malware genome with a deep neural network

Page 1: AI approach to malware similarity analysis: Maping the  malware genome with a deep neural network

AnAIApproachtoMalwareSimilarityAnalysis:MappingtheMalwareGenomeWithaDeep

NeuralNetwork

Dr.KonstantinBerlinSeniorResearchEngineer

InvinceaLabs1

Page 2: AI approach to malware similarity analysis: Maping the  malware genome with a deep neural network

EnterpriseNetworksareUnderConstantAttack

• Intelligenceiscriticalforprevention

• Cyberdefendersareoverwhelmed

• AIcanhelpfindimportantrelations

2

NumberofNetworkBreachesPerYear(Verizon’s 2016DataBreachInvestigationsReport)

Page 3: AI approach to malware similarity analysis: Maping the  malware genome with a deep neural network

IntelligencethroughMalwareTriage• Benefits

• Identifythreatactors• Linkvariousattackstoasingleactor

• Quicklyunderstandfunctionality• Speedupreverseengineering

• Mitigation• Signatures• NetworkRulesEnterpriseNetwork

3

Page 4: AI approach to malware similarity analysis: Maping the  malware genome with a deep neural network

Nearest-NeighborClassification

FeatureA

FeatureB

4

$$$

?

SimilaritySearch• MinHash• Featurehashing• Othersketching• …

Jang,Jiyong et.al.Proceedingsofthe18thACMconferenceonComputerandcommunicationssecurity.ACM,2011.Sæbjørnsen,Andreas,etal.Proceedingsof18th internationalsymposiumonSoftwaretestingandanalysis.ACM,2009.Bayer,Ulrich,etal.NDSS.Vol.9.2009.…Manymore

AttributeEmbedding

A B C D E

AttributeExtraction

Attributes• Byten-grams• Opcoden-grams• Printablestrings• Systemcalls• …

Page 5: AI approach to malware similarity analysis: Maping the  malware genome with a deep neural network

WhatCanGoWrongwithEmbedding?• Embeddings skewdistances

• Sameembeddeddatacangivedifferentneighbors(ex.AlaskaandRussia)

5

Page 6: AI approach to malware similarity analysis: Maping the  malware genome with a deep neural network

IssueswithAttributeEmbedding

6

FeatureA

FeatureB

PossibleAttributes#1

FeatureA

FeatureB

PossibleAttributes#2

?

?

Howtogetconsistentresults,regardlessoffeatures?

Page 7: AI approach to malware similarity analysis: Maping the  malware genome with a deep neural network

SupervisedClassification(EndpointSolution)

7

Layer1

Layer2

Layer3

LayerN [0,1]…

LayerX

backdoor.msil.bladabindi.aa

backdoor.msil.bladabindi.aj

worm.winnt.lurka.a…

XJoshua Saxe and Konstantin Berlin, (MALWARE). IEEE, 2015.

0.97F1-score(precisionandrecall)• 1500Microsoft

Families• 2.0MTrainingfiles

MalwareDetection

CategoricalClassification

Notenoughforatriagesystem!

AB

CD

E

Page 8: AI approach to malware similarity analysis: Maping the  malware genome with a deep neural network

Howdowemapmalwareintoanembeddingsothatdistancesmakesemantic sense?

8

Page 9: AI approach to malware similarity analysis: Maping the  malware genome with a deep neural network

ImaginaryWorldofMalwareFactories• IdealWorld

• Eachhiddenfactoryproducesonemalwarefamily/variant

• Factoriesarepositionedrelativetowhatandhowtheyexploitvulnerabilities

• …butthisnotwhatwehave!?

9

SecretsauceA

Secretsa

uceB

IdealizedEmbedding

Page 10: AI approach to malware similarity analysis: Maping the  malware genome with a deep neural network

ThereisNoSpoon Embedding…• Wecreatedtheembeddingwhenweselectedthefeatures

• Wecanmorphtheminanywaywechoose

• Onegoodwaytomorphthefeaturesisusingadeepneuralnetwork

10

DeepNeuralNetwork

A B C D E

a b c d

“TheMatrix”,1999

Page 11: AI approach to malware similarity analysis: Maping the  malware genome with a deep neural network

MorphingtheEmbedding

11

DeepNeuralNetwork(Morphing)

Noise

PredictedFamilies

worm.win32.vobfus.hc…

trojanspy.win32.nivdort.af

Variational Autoencoder

Onlyonefactory

Embedding clustersfamilies

Kingma,D.P.,&Welling,M.(2013).arXivpreprintarXiv:1312.6114.

A B C D E

a b c d

Encoder

Decoder

Embedding

FamilyLabels

Secretsauce

Page 12: AI approach to malware similarity analysis: Maping the  malware genome with a deep neural network

EmbeddingVisualization• ToyExample

• 8family/variantprediction• 2Dembedding

12

Onlyonefactory Embedding clustersfamilies

a b

a

bCollapse intoasingularity

Inter-classdistancesnotpreserved

aa

b b

A B C D E

Page 13: AI approach to malware similarity analysis: Maping the  malware genome with a deep neural network

PrintableStringsResults• 800Ksamples

• 1500family/variants(99%coverage)

• Time-splitValidation• Trainonolddata• Teston30dayslater

• MeasureF1-scoreof3-nearestneighborclassifier

13

FeatureVector

SemanticEmbedding

Deep-learningFeatures

F1-score0.44 F1-score0.94

F1-score0.96F1-score0.66

nogap

gap

smallgap

largegap

Page 14: AI approach to malware similarity analysis: Maping the  malware genome with a deep neural network

Conclusion• Developingfeatureextractionisexpensiveandrequirestimeconsumingtuningtoadapttoaspecificdomain

• Traditionalapproachestomalwaresimilarityareunsupervisedandsoarebrittle

• Usingsupervised-learningapproacheswecanimprove existingfeaturesbyembeddingthemintoamoreoptimizedspace

• Automatic(re)tuningwillimprovedetectionratesandreducecost

14

Page 15: AI approach to malware similarity analysis: Maping the  malware genome with a deep neural network

MoreInformation• Email:[email protected]• Twitter:@kberlin

15