MutantX-S: Scalable Malware Clustering Based on Static Features Xin Hu, IBM T.J. Watson Research...

18
MutantX-S: Scalable Malware Clustering Based on Static Features Xin Hu, IBM T.J. Watson Research Center; Sandeep Bhatkar and Kent Grif n, Symantec Research Labs; Kang G. Shin, University of Michigan 2015. 04. 21 박 박 박 [email protected] 박박박 박박 박 박박박박 박박박

Transcript of MutantX-S: Scalable Malware Clustering Based on Static Features Xin Hu, IBM T.J. Watson Research...

Page 1: MutantX-S: Scalable Malware Clustering Based on Static Features Xin Hu, IBM T.J. Watson Research Center; Sandeep Bhatkar and Kent Griffin, Symantec Research.

MutantX-S: Scalable Malware Clustering Based on Static Features

Xin Hu, IBM T.J. Watson Research Center; Sandeep Bhatkar and Kent Griffin, Symantec Research Labs; Kang G. Shin, University of Michigan

2015. 04. 21

박 종 화[email protected]

컴퓨터 보안 및 운영체제 연구실

Page 2: MutantX-S: Scalable Malware Clustering Based on Static Features Xin Hu, IBM T.J. Watson Research Center; Sandeep Bhatkar and Kent Griffin, Symantec Research.

Computer Security & OS Lab.

IndexIndex

2

MotivationArchitectureGeneric Unpacking AlgorithmFeature ExtractionPrototype-based clusteringEvaluation

Page 3: MutantX-S: Scalable Malware Clustering Based on Static Features Xin Hu, IBM T.J. Watson Research Center; Sandeep Bhatkar and Kent Griffin, Symantec Research.

Computer Security & OS Lab.

MotivationMotivation

3

Why clustering malware? The current lack of automatic and labeling

of a large number of malware sample

Page 4: MutantX-S: Scalable Malware Clustering Based on Static Features Xin Hu, IBM T.J. Watson Research Center; Sandeep Bhatkar and Kent Griffin, Symantec Research.

Computer Security & OS Lab.

MotivationMotivation

4

How to efficiently process this huge influx of new samples and accurately labels them?

Family 1Family 2

Family 3Family 4

One possible solution is to automatically cluster malware sample Prioritize limited resources Avoid analyzing samples that have already been analyzed Label new incoming samples by association Generalized previous detection and mitigation strategies

to new variants

Page 5: MutantX-S: Scalable Malware Clustering Based on Static Features Xin Hu, IBM T.J. Watson Research Center; Sandeep Bhatkar and Kent Griffin, Symantec Research.

Computer Security & OS Lab.

ArchitecureArchitecure

5

MutantX-S is framework developed to automatically detect malware. Does by analyzing a program’s static features(assembly code)

Process1. Preprocess2. Feature Extraction3. Clustering

Page 6: MutantX-S: Scalable Malware Clustering Based on Static Features Xin Hu, IBM T.J. Watson Research Center; Sandeep Bhatkar and Kent Griffin, Symantec Research.

Computer Security & OS Lab.

Generic Unpacking AlgorithmGeneric Unpacking Algorithm

6

Exploits an inherent property of unpacking process A packed binary has to write the unpacked code into some memory

space and transfer control to the modified memory locations to continue execution.

Tracks memory access via non-execution(NX) support in modern x86 CPU and OS

Packed malware

Page 7: MutantX-S: Scalable Malware Clustering Based on Static Features Xin Hu, IBM T.J. Watson Research Center; Sandeep Bhatkar and Kent Griffin, Symantec Research.

Computer Security & OS Lab.

Generic Unpacking AlgorithmGeneric Unpacking Algorithm

7

Exploits an inherent property of unpacking process A packed binary has to write the unpacked code into some memory

space and transfer control to the modified memory locations to continue execution.

Tracks memory access via non-execution(NX) support in modern x86 CPU and OS

Packed malware

W = 0X = 1

W = 0X = 1

W = 0X = 1

Packed data

Unpacker code

Memory pages

Process Memory

Executable but non-writable

loads

Page 8: MutantX-S: Scalable Malware Clustering Based on Static Features Xin Hu, IBM T.J. Watson Research Center; Sandeep Bhatkar and Kent Griffin, Symantec Research.

Computer Security & OS Lab.

Generic Unpacking AlgorithmGeneric Unpacking Algorithm

8

Exploits an inherent property of unpacking process A packed binary has to write the unpacked code into some memory

space and transfer control to the modified memory locations to continue execution.

Tracks memory access via non-execution(NX) support in modern x86 CPU and OS

Packed malware

W = 0X = 1

W = 0X = 1

W = 0X = 1

Packed data

Unpacker code

Memory pages

Process Memory

Executable but non-writable

Memory write

W Exception

Page 9: MutantX-S: Scalable Malware Clustering Based on Static Features Xin Hu, IBM T.J. Watson Research Center; Sandeep Bhatkar and Kent Griffin, Symantec Research.

Computer Security & OS Lab.

Generic Unpacking AlgorithmGeneric Unpacking Algorithm

9

Exploits an inherent property of unpacking process A packed binary has to write the unpacked code into some memory

space and transfer control to the modified memory locations to continue execution.

Tracks memory access via non-execution(NX) support in modern x86 CPU and OS

Packed malware

W = 1X = 0

W = 0X = 1

W = 0X = 1

Packed data

Unpacker code

Memory pages

Process Memory

Memory write

Dirty page marking

Page 10: MutantX-S: Scalable Malware Clustering Based on Static Features Xin Hu, IBM T.J. Watson Research Center; Sandeep Bhatkar and Kent Griffin, Symantec Research.

Computer Security & OS Lab.

Generic Unpacking AlgorithmGeneric Unpacking Algorithm

10

Exploits an inherent property of unpacking process A packed binary has to write the unpacked code into some memory

space and transfer control to the modified memory locations to continue execution.

Tracks memory access via non-execution(NX) support in modern x86 CPU and OS

Packed malware

W = 1X = 0

W = 1X = 0

W = 1X = 0

Packed data

Unpacker code

Memory pages

Process Memory

Finish unpacking

X Exception

Dump the process memory image

unpackedmalware

disassembler

Page 11: MutantX-S: Scalable Malware Clustering Based on Static Features Xin Hu, IBM T.J. Watson Research Center; Sandeep Bhatkar and Kent Griffin, Symantec Research.

Computer Security & OS Lab.

Feature ExtractionFeature Extraction

11

MutantX-S uses the IDA Pro to disassemble a malware program into a sequence of machine instructions that are then used for feature extraction.

Similarity comparison between malware samples based on the disassembled instruction sequences.

MutantX-S uses the opcode Opcodes generalize well to represent variants of a malware family. Opcode sequence offers a better representation of instruction semantics.

Page 12: MutantX-S: Scalable Malware Clustering Based on Static Features Xin Hu, IBM T.J. Watson Research Center; Sandeep Bhatkar and Kent Griffin, Symantec Research.

Computer Security & OS Lab.

Feature ExtractionFeature Extraction

12

N-gram analysis to embedded features into feature vectors- The number of dimensions D determines the complexity- D increases exponentially with N in N-gram( where |O| is

the number of different opcodes) Hashing kernel

Reduce dimensionality of the feature vector Save both storage and computation overhead Incur only small penalty on the feature vector distance

Page 13: MutantX-S: Scalable Malware Clustering Based on Static Features Xin Hu, IBM T.J. Watson Research Center; Sandeep Bhatkar and Kent Griffin, Symantec Research.

Computer Security & OS Lab.

Prototype-Based ClusteringPrototype-Based Clustering

13

The process repeats until the distance from all the data points to their nearest prototype is smaller than a predefined threshold Pmax .

Page 14: MutantX-S: Scalable Malware Clustering Based on Static Features Xin Hu, IBM T.J. Watson Research Center; Sandeep Bhatkar and Kent Griffin, Symantec Research.

Computer Security & OS Lab.

EvaluationEvaluation

14

Data set Reference data set : 4821 samples

Large data set : 132,234 samples System configuration

Core i7 3.0G Hz CPU 12 G memory

Page 15: MutantX-S: Scalable Malware Clustering Based on Static Features Xin Hu, IBM T.J. Watson Research Center; Sandeep Bhatkar and Kent Griffin, Symantec Research.

Computer Security & OS Lab.

Clustering Accuracy and Running TimeClustering Accuracy and Running Time

15

Comparing with existing cluster methods: MutantX : less than 30s, Hierarchical: 51.3(precision 0.82),

k-mean: 32.3s(precision 0.75)

Page 16: MutantX-S: Scalable Malware Clustering Based on Static Features Xin Hu, IBM T.J. Watson Research Center; Sandeep Bhatkar and Kent Griffin, Symantec Research.

Computer Security & OS Lab.

Impact of Hash SizeImpact of Hash Size

16

In practice, a 12-bit hash function is found to be a good compromise, reducingthe time and memory requirements by over 80% while still keeping good accuracy.

Page 17: MutantX-S: Scalable Malware Clustering Based on Static Features Xin Hu, IBM T.J. Watson Research Center; Sandeep Bhatkar and Kent Griffin, Symantec Research.

Computer Security & OS Lab.

ReferencesReferences

17

N-gram-based Detection of New Malicious Code. Tony Abou-Assaleh, Nick Cercone, Vlado Keˇselj, Ray Sweidan Privacy and Security Laboratory, Faculty of Computer Science, Dalhousie University

http://www.av-test.org/en/ https://

public.gdatasoftware.com/Presse/Publikationen/Malware_Reports/GData_PCMWR_H1_2014_EN_v2.pdf

http://endic.naver.com/ www.Wikipedia.org Etc.

Page 18: MutantX-S: Scalable Malware Clustering Based on Static Features Xin Hu, IBM T.J. Watson Research Center; Sandeep Bhatkar and Kent Griffin, Symantec Research.

Computer Security & OS Lab.18

Thank You !