CISC 879 - Machine Learning for Solving Systems Problems Presented by: Satyajeet Dept of Computer &...

18
CISC 879 - Machine Learning for Solving Systems Problems Presented by: Satyajeet Dept of Computer & Information Sciences University of Delaware Automatic Analysis of Malware Behavior using Machine Learning Author’s: Konrad Rieck, Philipp Trinius, Carsten Willems, and Thosten Holz

Transcript of CISC 879 - Machine Learning for Solving Systems Problems Presented by: Satyajeet Dept of Computer &...

Page 1: CISC 879 - Machine Learning for Solving Systems Problems Presented by: Satyajeet Dept of Computer & Information Sciences University of Delaware Automatic.

CISC 879 - Machine Learning for Solving Systems Problems

Presented by: SatyajeetDept of Computer & Information Sciences

University of Delaware

Automatic Analysis of Malware Behavior using Machine LearningAuthor’s: Konrad Rieck, Philipp Trinius, Carsten Willems, and

Thosten Holz

Page 2: CISC 879 - Machine Learning for Solving Systems Problems Presented by: Satyajeet Dept of Computer & Information Sciences University of Delaware Automatic.

CISC 879 - Machine Learning for Solving Systems Problems

Abstract & Introduction

• Malware - • Poses major threat to security of computer systems.

• Very diverse – viruses, internet worms, trojan horses,

• Amount of malware – millions of hosts infected

• Obfuscation and polymorphism impede detection at file level

• Dynamic analysis helps characterizing and defending.

Page 3: CISC 879 - Machine Learning for Solving Systems Problems Presented by: Satyajeet Dept of Computer & Information Sciences University of Delaware Automatic.

CISC 879 - Machine Learning for Solving Systems Problems

Abstract & Introduction Contd..

• Framework for automatic analysis of malware behavior using Machine learning

• Framework allows automatic analysis of novel classes of malware with similar behavior – Clustering.

• Assigning unknown classes of malware to these discovered classes – Classification.

• An incremental approach based on both for behavior based analysis.

Page 4: CISC 879 - Machine Learning for Solving Systems Problems Presented by: Satyajeet Dept of Computer & Information Sciences University of Delaware Automatic.

CISC 879 - Machine Learning for Solving Systems Problems

Automatic analysis of Malware Behavior

• Framework steps and procedure• Executing and monitoring malware binaries in

sandbox environment. Report generated on system calls and their arguments.

• Sequential reports are embedded in a vector space where each dimension is associated with a behavioral pattern.

• ML techniques then applied to the embedded reports to identify and classify malware.

• Incremental analysis progress by alternating between clustering and classification.

Page 5: CISC 879 - Machine Learning for Solving Systems Problems Presented by: Satyajeet Dept of Computer & Information Sciences University of Delaware Automatic.

CISC 879 - Machine Learning for Solving Systems Problems

Report representation• Can be textual or XML

• Human readable and suitable for computation of general statistics

• But not efficient for automatic analysis

• Hence MIST (Malware Instr. Set)

• Inspired from instr. set used in process design.

Page 6: CISC 879 - Machine Learning for Solving Systems Problems Presented by: Satyajeet Dept of Computer & Information Sciences University of Delaware Automatic.

CISC 879 - Machine Learning for Solving Systems Problems

MIST

• Category of system calls

• Operation - Reflects a particular system call

• Arguments as argblocks.

Page 7: CISC 879 - Machine Learning for Solving Systems Problems Presented by: Satyajeet Dept of Computer & Information Sciences University of Delaware Automatic.

CISC 879 - Machine Learning for Solving Systems Problems

Sandbox and MIST representation

Page 8: CISC 879 - Machine Learning for Solving Systems Problems Presented by: Satyajeet Dept of Computer & Information Sciences University of Delaware Automatic.

CISC 879 - Machine Learning for Solving Systems Problems

Representation

• These sequential reports identify typical behavior of malware – Changing registry keys, modifying system files.

• But still not suitable for efficient analysis techniques. Hence the need to embed behavior reports in vector space – Using instruction q-grams.

• This embedding enables expressing the similarity of behavior geometrically – Calculating distance.

Page 9: CISC 879 - Machine Learning for Solving Systems Problems Presented by: Satyajeet Dept of Computer & Information Sciences University of Delaware Automatic.

CISC 879 - Machine Learning for Solving Systems Problems

Clustering and Classification

• Reports are embedded in vector space – Process ready for applying ML techniques

• Clustering of behavior – where classes of similar behavior malware are identified.

• Classification of behavior – which allows to assign malware to known classes of behavior.

• What allows us to do this?

• Malware binaries are a family of similar variants with similar behavior patterns !

Page 10: CISC 879 - Machine Learning for Solving Systems Problems Presented by: Satyajeet Dept of Computer & Information Sciences University of Delaware Automatic.

CISC 879 - Machine Learning for Solving Systems Problems

Contd..

Page 11: CISC 879 - Machine Learning for Solving Systems Problems Presented by: Satyajeet Dept of Computer & Information Sciences University of Delaware Automatic.

CISC 879 - Machine Learning for Solving Systems Problems

Algorithms

• Prototype extraction

• Iterative algorithm

• Extracts small set of prototypes from set of reports. First one chosen at random.

• Clustering using Prototypes

• Prototypes at beginning are individual clusters

• Algorithm determines and merges nearest pairs of clusters

• Classification using Prototypes

• Allows to learn to discriminate between classes of malware.

Page 12: CISC 879 - Machine Learning for Solving Systems Problems Presented by: Satyajeet Dept of Computer & Information Sciences University of Delaware Automatic.

CISC 879 - Machine Learning for Solving Systems Problems

Algorithms Contd..

• For each report algorithm determines the nearest prototype of clusters in training data, if within radius then assigns to cluster

• Else rejects and holds back for later incremental analysis.

• Incremental analysis• Reports to be analyzed are received from source.

• Initially classified using prototypes of known clusters

• Thereby variants of known malware are identified for further analysis.

• Prototypes extracted from remaining reports and clustered again.

Page 13: CISC 879 - Machine Learning for Solving Systems Problems Presented by: Satyajeet Dept of Computer & Information Sciences University of Delaware Automatic.

CISC 879 - Machine Learning for Solving Systems Problems

Experiments and Results

Page 14: CISC 879 - Machine Learning for Solving Systems Problems Presented by: Satyajeet Dept of Computer & Information Sciences University of Delaware Automatic.

CISC 879 - Machine Learning for Solving Systems Problems

Evaluating components

• Prototype extraction

• Evaluated using Precision, Recall and Compression.

• Precision – 0.99 when corpus compressed by 2.9 % & 7%

• Clustering

• Evaluated using F-measure

• F-measure for experiments – MIST 1 = 0.93 and MIST 2 = 0.95 better than previous related work 0.881

• Classification

• F-measure for experiments – MIST 1= 0.96 and MIST 2 = 0.99

Page 15: CISC 879 - Machine Learning for Solving Systems Problems Presented by: Satyajeet Dept of Computer & Information Sciences University of Delaware Automatic.

CISC 879 - Machine Learning for Solving Systems Problems

Experiments and Results Contd..

Page 16: CISC 879 - Machine Learning for Solving Systems Problems Presented by: Satyajeet Dept of Computer & Information Sciences University of Delaware Automatic.

CISC 879 - Machine Learning for Solving Systems Problems

Experiments and Results Contd..

Page 17: CISC 879 - Machine Learning for Solving Systems Problems Presented by: Satyajeet Dept of Computer & Information Sciences University of Delaware Automatic.

CISC 879 - Machine Learning for Solving Systems Problems

Conclusion

• A new framework introduced which overcomes several previous deficiencies.

• The framework is learning based

• Framework can be implemented in practice

• Steps – Collection of malware, a study in sandbox environment, embed observed behavior in vector space, apply learning algorithms – clustering and classification.

• This process is efficient and learns automatically after initial setup and run.

Page 18: CISC 879 - Machine Learning for Solving Systems Problems Presented by: Satyajeet Dept of Computer & Information Sciences University of Delaware Automatic.

CISC 879 - Machine Learning for Solving Systems Problems

Thank you !