Feature Engineering & Analysis Towards Temporally Robust ... › sites › default › files ›...

Feature Engineering & Analysis TowardsTemporally Robust Detection of Android

Malware

A thesis submitted in partial fulfilment of the requirements

for the degree of Master of Technology

by

Sagar Jaiswal

17111037

under the guidance of

Prof. Sandeep K. Shukla

to the

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

INDIAN INSTITUTE OF TECHNOLOGY KANPUR

July 2019

https://www.cse.iitk.ac.in/users/sagarj/

https://www.cse.iitk.ac.in/users/sandeeps/

Department or School Web Site URL Here (include http://www.iitk.ac.in/cse)

University Web Site URL Here (include http://www.iitk.ac.in)

Abstract

Name of the student: Sagar Jaiswal Roll No: 17111037

Degree for which submitted: M.Tech.

Department: Computer Science & Engineering

Thesis title: Feature Engineering & Analysis Towards Temporally Robust

Detection of Android Malware

Thesis supervisor: Prof. Sandeep K. Shukla

Month and year of thesis submission: July 2019

With the increase in popularity of Android, the number of active users and the day to day

activity of each user on Android devices have also increased a lot. This has led to malware

authors targeting Android devices more and more. It has been reported that 8400 new

instances of malware are found every day. This implies that a new malware surfaces every

10 seconds! With the growth in amount, variants, diversity and sophistication in malware,

conventional methods often fail to detect malicious applications. Moreover Android allows

downloading and installation of Android applications from unverified sources. Therefore

it has become very easy for malware authors to bundle and distribute applications with

malware. Therefore fast and accurate detection of Android malware has become a real

challenge. These issues call for a more reliable, accurate, generalized and efficient model

for Android malware detection.

In this thesis, we build a light weight malware detection model that is capable of fast,

generalized, accurate and efficient detection of Android malware. To fulfill this objective,

we have designed a framework that deals with the challenges faced in Android malware

detection. We work with more than 8 lac’s of samples that are spread out over a period

of 2010 to 2019. We have created multiple datasets to perform parallel analysis for more

iv

robustness. Unlike previous datasets, the datasets we work on are highly class balanced

and therefore it is expected that machine learning models should train better on them. We

extract several categories of information like permissions, API, Intents, App Components

etc. and analyse their effectiveness towards Android malware detection. Then with the aim

of creating a light weight detection model, we propose and demonstrate the effectiveness

of three feature selection techniques to analyze and identify the only relevant features -

these are the features that are the most informative and important for malware detection

for samples across different years. Finally, we present relevant sets of features and a model

capable of temporally robust detection of Android malware.

Acknowledgements

I would like to express my sincere gratitude to my thesis advisor Prof. Sandeep K. Shukla

for his constant support and guidance. His inputs throughout the course of the thesis work

were extremely beneficial and constructive.

I would like to acknowledge my fellow graduate students for their feedback. I am also

grateful to Mr Saurabh Kumar for his input and cooperation.

I would also like to thank my family. Without their support and encouragement, this

accomplishment would not have been possible. I would also like to thank my friends for

always being there whenever I needed them.

v

Contents

Abstract iii

Acknowledgements v

Contents vi

List of Figures x

List of Tables xi

Abbreviations xiii

1 Introduction 1

1.1 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.4 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.5 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Background 4

2.1 Android System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.1.1 Linux Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.1.2 Hardware Abstraction Layer . . . . . . . . . . . . . . . . . . . . . . 6

2.1.3 Android Runtime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1.4 Native C/C++ Libraries . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1.5 Java API Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.6 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 Android Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2.1 Application Fundamentals . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2.1.1 Application Manifest File . . . . . . . . . . . . . . . . . . . 8

2.2.1.2 Dex File . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2.1.3 Permissions . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2.1.4 Application Components . . . . . . . . . . . . . . . . . . . 10

vi

Contents vii

2.2.1.5 Intents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2.1.6 Opcodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2.1.7 System Commands . . . . . . . . . . . . . . . . . . . . . . 11

2.3 Android Malware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3.1 Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3.2 Types of Malware . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3.3 Malware Detection Approaches . . . . . . . . . . . . . . . . . . . . . 13

2.3.3.1 Invasion Detection . . . . . . . . . . . . . . . . . . . . . . . 13

2.3.3.2 Misuse Detection . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3.3.3 Anomaly Detection . . . . . . . . . . . . . . . . . . . . . . 13

2.4 Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.4.1 Androguard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.4.2 Spyder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.5 Machine Learning Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3 Dataset 18

3.1 D/2010-2012 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.2 D/2013-2015 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.3 D/2016-2019 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.4 D/2010-2019 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.5 D/DREBIN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.6 D/AMD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4 Framework Design 22

4.1 Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.2 Framework Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.2.1 Information extraction module . . . . . . . . . . . . . . . . . . . . . 24

4.2.2 Feature Engineering & Analysis Module . . . . . . . . . . . . . . . . 25

4.2.2.1 Feature Construction/Representation . . . . . . . . . . . . 26

4.2.2.2 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . 26

4.2.2.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.2.3 Learning Detection Model Module . . . . . . . . . . . . . . . . . . . 29

5 Methodology 31

5.1 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5.1.1 Feature Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5.1.1.1 Permissions . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

5.1.1.2 API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

5.1.1.3 Application Components . . . . . . . . . . . . . . . . . . . 33

5.1.1.4 Intents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

5.1.1.5 System Commands . . . . . . . . . . . . . . . . . . . . . . 33

5.1.1.6 Opcodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

5.1.1.7 Misc Features . . . . . . . . . . . . . . . . . . . . . . . . . 34

5.1.2 Feature Representation . . . . . . . . . . . . . . . . . . . . . . . . . 34

Contents viii

5.1.3 Types of Feature sets . . . . . . . . . . . . . . . . . . . . . . . . . . 34

5.1.4 Feature size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5.2 Data Cleaning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5.3 Frequency Of usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5.4 RFECV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5.5 RFE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5.6 Learning Final Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.7 Effectiveness of Identified Features . . . . . . . . . . . . . . . . . . . . . . . 48

6 Experimentation and Results 49

6.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

6.1.1 System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

6.1.2 Machine Learning Classifiers . . . . . . . . . . . . . . . . . . . . . . 49

6.1.3 Dataset Partition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

6.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

6.2.1 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

6.2.2 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

6.2.3 Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

6.2.4 Recall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

6.2.5 F1 Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

6.3.1 Analysis Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

6.3.1.1 Requested Permissions . . . . . . . . . . . . . . . . . . . . 54

6.3.1.2 Used Permissions . . . . . . . . . . . . . . . . . . . . . . . 55

6.3.1.3 Restricted API . . . . . . . . . . . . . . . . . . . . . . . . . 56

6.3.1.4 Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

6.3.1.5 Receiver . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

6.3.1.6 Activity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

6.3.1.7 Providers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

6.3.1.8 Intent Filters . . . . . . . . . . . . . . . . . . . . . . . . . . 61

6.3.1.9 Intent Objects . . . . . . . . . . . . . . . . . . . . . . . . . 62

6.3.1.10 Intent Const . . . . . . . . . . . . . . . . . . . . . . . . . . 63

6.3.1.11 API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

6.3.1.12 API packages . . . . . . . . . . . . . . . . . . . . . . . . . . 65

6.3.1.13 System Commands . . . . . . . . . . . . . . . . . . . . . . 66

6.3.1.14 Opcodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

6.3.1.15 Misc Features . . . . . . . . . . . . . . . . . . . . . . . . . 68

6.3.2 Final Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

6.3.2.1 Base Model Results . . . . . . . . . . . . . . . . . . . . . . 69

6.3.2.2 Individual Category Results . . . . . . . . . . . . . . . . . 69

6.3.2.3 Combined Category Results . . . . . . . . . . . . . . . . . 70

6.3.2.4 Results on Test set . . . . . . . . . . . . . . . . . . . . . . . 70

6.3.2.5 Effectiveness of Identified Features . . . . . . . . . . . . . . 70

Contents ix

7 Conclusion and Future Work 74

7.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

Bibliography 75

List of Figures

2.1 Android System Architecture [2] . . . . . . . . . . . . . . . . . . . . . . . . 5

4.1 The Proposed Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.2 Information Extraction Module . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.3 Feature Engineering & Analysis Module . . . . . . . . . . . . . . . . . . . . 25

4.4 Feature Selection Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

5.1 Drop in Number of Features in Feature sets . . . . . . . . . . . . . . . . . . 38

5.2 Accuracy vs Number of features plot for API Package based on frequencyof usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5.3 Accuracy vs Number of Features plot for System Commands based on fre-quency of usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.4 Accuracy vs Number of Features plot for Requested Permissions based onfrequency of usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.5 Accuracy vs Number of Features plot for Requested Permissions using RFECV 44

5.6 Accuracy vs Number of Features plot for API Packages using RFECV . . . 45

5.7 Accuracy vs Number of Features plot for Intent Const using RFECV . . . . 46

5.8 Accuracy vs Number of Features plot for Opcodes using RFECV . . . . . . 47

x

List of Tables

2.1 Datasets for previous works . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.1 Datasets Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

5.1 List of Feature Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5.2 Number of Features in each Feature set . . . . . . . . . . . . . . . . . . . . 36

6.1 Partition of Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

6.2 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

6.3 Performance results on evaluation set for requested permissions using initialfeatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

6.4 Performance results on evaluation set for requested permissions using finalfeatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

6.5 Performance results on evaluation set for used permissions using initial fea-tures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

6.6 Performance results on evaluation set for used permissions using final features 55

6.7 Performance results on evaluation set for restricted API using initial features 56

6.8 Performance results on evaluation set for restricted API using final features 56

6.9 Performance results on evaluation set for service using initial features . . . 57

6.10 Performance results on evaluation set for service using final features . . . . 57

6.11 Performance results on evaluation set for receiver using initial features . . . 58

6.12 Performance results on evaluation set for receiver using final features . . . . 58

6.13 Performance results on evaluation set for activity using initial features . . . 59

6.14 Performance results on evaluation set for activity using final features . . . . 59

6.15 Performance results on evaluation set for providers using initial features . . 60

6.16 Performance results on evaluation set for providers using final features . . . 60

6.17 Performance results on evaluation set for Intent Filters using initial features 61

6.18 Performance results on evaluation set for Intent Filters using final features . 61

6.19 Performance results on evaluation set for Intent Objects using initial features 62

6.20 Performance results on evaluation set for Intent Objects using final features 62

6.21 Performance results on evaluation set for Intent Const using initial features 63

6.22 Performance results on evaluation set for Intent Const using final features . 63

6.23 Performance results on evaluation set for API using initial features . . . . . 64

6.24 Performance results on evaluation set for API using final features . . . . . . 64

6.25 Performance results on evaluation set for API packages using initial features 65

6.26 Performance results on evaluation set for API packages using final features 65

xi

List of Tables xii

6.27 Performance results on evaluation set for system commands using initialfeatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

6.28 Performance results on evaluation set for system commands using Finalfeatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

6.29 Performance results on evaluation set for opcodes using initial features . . . 67

6.30 Performance results on evaluation set for opcodes using final features . . . . 67

6.31 Performance results on evaluation set for Misc Features using initial features 68

6.32 Performance results on evaluation set for Misc Features using final features 68

6.33 Performance results of relevant categories on evaluation set . . . . . . . . . 70

6.34 Performance results using combined categories . . . . . . . . . . . . . . . . 72

6.35 Performance results of final model on different test sets . . . . . . . . . . . 73

6.36 Performance results to show effectiveness of identified features . . . . . . . . 73

Abbreviations

OS Operating System

AV Anti Virus

DVM Dalvik Virtual Machine

JVM Java Virtual Machine

VM Virtual Machine

ART Android Runtime

ELF Executable and Linkable Format

NDK Native Development Kit

SDK Software Development Kit

APK Android Package

LR Logistic Regression

NN Neural Network

RF Random Forest

ET Extra Tree

Acc Accuracy

Prec Precision

Rec Recall

F1 F1 Score

RP Requested Permission

IC Intent Const

AP API Packages

SC System Commands

O Opcodes

xiii

Abbreviations xiv

M Miscellaneous

Dedicated To my Parents

xv

Chapter 1

Introduction

In recent years, Android has overtaken many other mobile operating systems to become

one of the most popular mobile platform in the world. A recently shared report by the IDC

for smartphone operating system global market share [13] shows that in 3rd quarter of the

year 2018, the total market share of Android was 86.8%. Developers prefer Android over

other smartphone operating systems for developing applications because it is completely

open source. On the other hand, a user is biased to opt for Android smartphones due

to availability of low to high end models, easy to use, customization, high level of multi-

tasking, custom ROMs, support of large number of applications etc. In May 2019, Google

revealed that there are now more than two and half billion Android devices that are being

used actively in a month [21].

1.1 Problem

With the increase in popularity of Android, the number of active users and the day to day

activity of each user on Android devices have also increased a lot. This has led to malware

authors targeting Android devices more and more. It has been reported by Gadgets360

[33] that 8400 new instances of malware are found every day. This implies that a new

malware surfaces every 10 seconds!

1

Chapter 1. Introduction 2

1.2 Motivation

Google highly recommends to use trusted and verified sources like Google Play Store

for installation of Android applications. Google Play Store uses Google Play Protect to

provide security features to protect users from malicious applications [2]. It scans all the

Android applications before and after installation to ensure no foul play is happening.

However, due to the increasing numbers and variants in malware, they keep getting into

the Google Play Store. According to a news report by techcrunch [36], a new kind of

mobile adware got past the Google Play Store security mechanisms and hid in hundreds of

Android applications. These infected applications were downloaded more than 150 million

times. Third party markets can also be used for downloading and installation of Android

applications. Android allows downloading and installation of Android applications from

unverified sources. Therefore it has become very easy for malware authors to bundle and

distribute applications with malware.

1.3 Challenges

With the growth in amount, variants, diversity and sophistication in malware, conventional

methods often fail to detect malicious applications. Malware authors often deploy various

techniques like code obfuscation to avoid detection [5]. Such techniques are implemented to

protect malware from reverse engineering. This leads to transformation of strings/features

into a different form that are less informative and contribute rarely to machine learning

based malware detection systems. As a result of such transformations, the feature set

size grows as well which increases the complexity for the machine learning techniques to

train a classifier for efficient malware detection. Therefore fast and accurate detection

of Android malware has become a real challenge. These issues call for a more reliable,

accurate, generalized and efficient model for Android malware detection.

1.4 Contribution

The aim and objective of this work is to face the challenges in the area and build a light

weight malware detection model that is capable of fast, generalized, accurate and efficient

detection of Android malware. To fulfill this objective, we have designed a framework that

deals with the challenges faced in Android malware detection. The major contributions of

this work towards Android malware analysis and detection are -

Chapter 1. Introduction 3

• We analyze the effectiveness of different features extracted using static analysis for

detecting Android malware using multiple datasets.

• We present several Feature Selection techniques that filter noise present in the data

and output only relevant and informative features.

• We evaluate and demonstrate the effectiveness of our model for Android malware

detection.

1.5 Thesis Outline

Chapter 2 presents an outline of Android by discussing the Android platform architecture,

Android applications and its fundamentals. Then it provides an introduction to malware,

its evolution over time, types of malware and malware detection approaches. It also briefly

explains the tools used during this study. Finally, it ends with a discussion on related work

in this field. Chapter 3 discusses about the source of Android applications - both benign

and malware. It describes the steps involved in creation of each dataset along with its size

and properties. Chapter 4 provides an insight into the framework and its different modules.

Chapter 5 discusses the features constructed, its size, properties, trends and patterns. It

also describes the implementation details of the proposed framework. Chapter 6 presents

the experimentation and the results of the analysis. It talks about the setup environment

of the system, dataset split into train, evaluation & test sets, feature engineering and

model performances. Finally, Chapter 7 concludes this study and provides direction for

future work.

Chapter 2

Background

Android was developed by Google. It is an open source mobile OS based on the Linux

kernel and have been released under the Apache v2 open source license. The following

sections provide an overview of Android System Architecture, benign & malware applica-

tion, Android tools and techniques to distinguish them and previous research work done

in this field.

2.1 Android System Architecture

Android Architecture [2] is a stack that contains around 6 major components as shown in

the Fig 2.1 Each component in the stack along with its corresponding elements are inte-

grated in a manner that provides optimal environment for the development and execution

of applications on Android devices. The following subsections discuss each of these layers

in detail.

2.1.1 Linux Kernel

There are many advantages of using the open source Linux kernel. It serves as the foun-

dation of the Android platform [19]. The underlying functionalities of the Linux kernel

are -

• Device drivers

• Power management

4

Chapter 2. Background 5

Figure 2.1: Android System Architecture [2]


• Memory management

• Device management

• Process management

• Security management

• Network stack

2.1.2 Hardware Abstraction Layer

This layer provides a level of abstraction by allowing the device hardware and the software

to communicate. It defines a standard interface which enables implementation of hardware

capabilities by higher-level Java API frameworks. There are several modules present in the

hardware abstraction layer. They are loaded when a call to access a hardware component

is made. Hardware abstraction layer implementations are done without affecting higher

levels [12].

2.1.3 Android Runtime

DVM was the default runtime environment before Android version 5.0 (API level 21).

DVM is similar to JVM and uses Just-in-Time compilation where every time the applica-

tion is launched, the dex code is translated into machine dependent code within a virtual

machine.

ART is the runtime environment for Android version 5.0 (API level 21) or higher. It works

on the principle of Ahead-of-Time compilation. In ART environment, an APK is compiled

into the native code known as ELF at the time of installation. The ELF executable

version is run every time the application is launched resulting in better performance of the

application.

2.1.4 Native C/C++ Libraries

The functionality of native libraries is accessed through Java framework APIs. Android

NDK is used to access native platform libraries while developing an application with

requirements of C and C++ code. With the help of natives libraries (written in C/C++

language), native code for Android system components and services are built.


2.1.5 Java API Framework

This layer lies on top of core libraries & Android runtime and provide interfaces for the

development of Android applications. It also provides high level services to Android ap-

plications using Java classes. Applications directly interact with this framework. This

framework includes APIs such as telephony, locations, resources, package managers etc.

These APIs (written in Java language) are used to access the feature set provided by the

Android OS . Using API simplifies the process of developing an Android application by

reuse of components and services like -

• Telephony Manager

• Resource Manager

• Activity Manager

• Notifications Manager

• View System

• Content Providers

• Package Manager

• Location Manager

2.1.6 Applications

Applications lie on the top most layer of the system architecture. There are two types of

applications -

• Default applications - They are the set of core applications such as those for SMS

or MMS messaging, contacts, email and internet browsing, etc that are provided by

the device vendor itself.

• Third party applications - These are the set of applications that are developed by a

third party.


2.2 Android Applications

These applications are developed to run on Android operating system. They are available

on Android application markets like Google Play Store, Amazon Store etc. The code for

Android applications (written in Java, Kotlin or C++) along with other things like data,

resource files etc. is compiled into an APK by the SDK tools provided by Android. APK is

a type of archive file (E.g. zip, jar, tar etc.) with .apk as the extension [39]. It is based on

the Jar File format. Android devices uses this format to install the Android application.

An APK contains -

• META-INF directory: This directory contains files like MANIFEST.MF, CERT.RSA etc..

• lib: This directory contains the platform dependent compiled code for processors

like MIPS, x86, x86 64, ARM etc.

• resources.arsc: The precompiled resources like binary XML are included in this

file. The resources that are not compiled here are placed in res directory.

• assets: This directory contains applications assets.

• AndroidManifest.xml: This file is an additional Android manifest file and contains

information like version, name etc.

• classes.dex: classes.dex is in the dex file format and it represents compiled

classes. Both Dalvik virtual machine and Android Runtime can understand the dex

format.

2.2.1 Application Fundamentals

This section provides a brief synopsis of the Android application.

2.2.1.1 Application Manifest File

Every application must have the application manifest file – AndroidManifest.xml. This

manifest file includes important information about the application that are needed by the

Android OS like – application’s package name, version name and version code, components

of the application, permissions etc.


2.2.1.2 Dex File

In an Android project, all the Java source files are compiled into a file format called

.class. The .class file contains java bytecode instructions. They in turn are translated

into the .dex format (dex bytecode). Both the Android runtime environments (older DVM

and newer ART) can understand the dex bytecode. The classes and the methods in an

application are referenced by the classes.dex file.

2.2.1.3 Permissions

Android security architecture does not allow any application by default to perform op-

erations like accessing user’s private data, keeping the device awake, accessing another

application’s data or any other operation that would harm the user, the operating system,

or other apps . So to access sensitive user data or system features, Android apps must

request permission. All such permissions must be declared in the manifest file. There are

four protection levels defined for Android Permissions -

• Normal permissions - The permissions that can cause very little impact or harm to

the user’s privacy when granted are categorized as Normal Permissions. Therefore

normal permissions are automatically granted upon request.

• Dangerous permissions - Permissions to sensitive data and resources are categorized

as dangerous permissions. They are prompted to the user to be granted at runtime.

Only dangerous permissions require agreement from the user.

• Signature permissions - This permission is granted at the install time. The ap-

plication that attempts to use the permission & the application that defines the

permission must be signed by the same certificate.

• SignatureOrSystem permissions - These permissions works similar to the Signature

level permission. The only difference is the certificate used to sign the application.

SignatureOrSystem permission are related to the device vendor certificate and only

granted to those applications that are signed by the certificate of device vendor.


2.2.1.4 Application Components

For an Android application, application components serve as an essential building block.

Each Component defines an entry point. An entry point is used by the user or the system

to enter the application. There exists four types of application components -

• Activity - This component defines an entry point for the user. It provides an interface

with a single screen for interacting with the user. E.g. in an email app, there can

be several activities, one to show the new emails list, another to read them etc.

• Services - This component serves as the general purpose entry point to the application

and is used for running background operations. A service can also be used for

providing functionality to other applications. It does not provide any user interface.

E.g. working on a different application while the music is being played by a service

in the background.

• Broadcast Receivers - This component defines another entry point into the applica-

tion which enables the applications to respond to system wide broadcast announce-

ments even when the application is not running. E.g. setting an alarm for an

upcoming event, low battery announcement.

• Content Providers - This component manages application data. With the help of

the content provider, one application can share its data with another application

securely. For example, an application can read the contact details with the help of

contact’s content provider.

2.2.1.5 Intents

An intent describes an operation or a message to facilitate inter-process and intra-process

communications. They are used to activate activities, services, and broadcast receivers.

At runtime, individual components are bound to each other.

2.2.1.6 Opcodes

Opcodes refers to the Dalvik instructions in the raw form similar to assembly instruc-

tions. These instructions are mainly designed to be interpreted by the android runtime

environment (DVM/ART).


2.2.1.7 System Commands

System commands refers to executing a shell command (e.g. su, uname, etc.) inside an

application. To execute a command, Android uses a exec() of the Runtime class by

creating a sub process.

2.3 Android Malware

With the growing popularity of the Android, focus of attackers and malware authors

towards targeting the Android platform has also increased a lot. Malware authors are

taking the task of creating malware at new levels. Malware are now more sophisticated

and impactful. These malware are being disguised as benign and useful applications.

2.3.1 Evolution

This section shows the evolution of malware in mobile platforms [37].

• Timofonica - the first known mobile virus. Its discovery can be traced back to June

2000.

• In June 2004, an anti-piracy Trojan hack was made by ojam. Then later in July,

a proof-of-concept virus named ‘Cabir’ targeting the Symbian OS was released.

‘Cabir’ virus used bluetooth to spread [40].

• In March 2005, Commonwarrior-A, a computer worm replicating through MMS and

targeting Symbian OS was reported. Commonwarrior was built on basic concepts of

Cabir. It was the first malware that affected the victims by financially exploiting

them.

• In 2006, RedBrowser - the first trojan that has the capability to infect multiple

mobile platform was discovered. It was built by further extending Commonwarrior

functionalities.

• In August 2010, AndroidOS.FakePlayer.a, a Trojan which exploited SMS services

was discovered by Kaspersky Lab. AndroidOS.FakePlayer.a was the first SMS

malware that targeted the Android operating system.


• In 2011, DroidDream, a Trojan with the ability to send sensitive data to remote

servers as well as install other applications without the users consent was discovered

in the Google Play Store. DroidDream has infected more than 50 applications causing

damage to hundreds of thousands of victims.

• In 2012, Boxer a Trojan that showed similar behavior to Commonwarrior was dis-

covered. Boxer used mobile country and network code to distribute itself via text

messages in more than 60 countries.

• In 2013, FakeDefender, the first ransomware targeting Android was discovered.

2.3.2 Types of Malware

There are several types of malware like -

• Ransomware - Ransomware encrypts data making it inaccessible & unavailable to

the use.

• Spyware - Spyware spies on the Android devices and steals valuable information like

passwords etc.

• Adware - Adware is a type of malware that shows continuous popups of advertise-

ments.

• Trojan - A trojan gets inserted in the user’s computer by disguising as a program

that the user has willingly downloaded.

• Virus - A virus is a type of malware which when executed infects other files by

inserting malicious code.

• Worm - Worms are the type of malware that keeps reproducing itself to infect other

devices. To execute, worms does not need any user interaction.

• Expander - Expander harms the user by increasing the billing amounts.

• Backdoor - Backdoor is a piece of malicious code that allows unauthorized access to

the infected devices.


2.3.3 Malware Detection Approaches

Malware detection approaches can be categorised into three parts.

2.3.3.1 Invasion Detection

This method detects any attack or unauthorized access.

2.3.3.2 Misuse Detection

This method detects any misuse by insiders. It stores signatures of malware. Advantages:

• Very good detection of known malware

• No false positives

Disadvantages:

• This method cannot detect even minimal variants of known malware

2.3.3.3 Anomaly Detection

This method detects malware based on their behaviour. They detect specific patterns in

the given samples and try to identify abnormal behaviour.

Advantages:

• This method can efficiently detect unseen and unknown malware.

Disadvantages:

• A large set of unique samples is needed to identify malicious behaviour.

• High false positive rate

There are two types of Anomaly detection techniques -


• Static Analysis - In Static analysis, we can analyse an application/program without

executing it. This technique can be applied on both the source code as well as binary

file (in our case APK) to find out the security threats or the malicious functionality

in an application.

• Dynamic Analysis - In dynamic analysis, behaviour of an application is analysed dur-

ing its execution in an isolated environment. This method can capture the malicious

behaviour, which is not disclosed in the static analysis.

2.4 Tools

2.4.1 Androguard

Androguard[7] is Python based tool that is widely used for reverse engineering of Android

applications and performing static analysis.

2.4.2 Spyder

Spyder [30] is a python based environment designed for editing, analysis and debugging of

python code.

2.5 Machine Learning Classifiers

Classification is the process of predicting a class or category of a sample. In the case of

binary classification, there are only two categories. Classifiers are the algorithms that do

the classification [38]. There are several classifiers like -

• Logistic Regression [20]

• Random Forest [25]

• Neural Network [24]

• Extra Tree [23]


2.6 Related Work

• Wang et.al. [34] extracts used permissions and API calls from applications to train

a model using AdaboostM1 classifier. By evaluating their model on 1170 malware

samples and 1205 benign samples, they claim a True Positive Rate of 99.6%.

• Feizollah et.al. [8] shows effectiveness of explicit Intent and implicit Intent for An-

droid malware detection. The evaluation have been done on 5560 malware samples

and 1846 benign samples. They claim 91% detection accuracy(TPR) using Intents.

• Sun et.al. [32] proposes a permission based model trained using SVM. It claims an

accuracy of 93.62% for malware in the dataset and 91.4% for unknown malware.

• Firdaus et.al. [10] focuses on system commands for the identification of root-malware

through static analysis. The evaluation has been done by selecting 550 malware

out of from 1260 malware available in Malgenome dataset and 550 benign samples

downloaded from Google Play Store. It claims an accuracy of 92.5%.

• Gaviria et.al. [11] performs analysis on 1259 benign samples collected from Google

Play Store and 1259 malware samples collected from Android Genome Project tar-

geting opcodes as features using several machine learning classifiers. They claim a

precision of 96.758%.

• Using static analysis, DREBIN [3] has gathered 8 types of features from 1,23,453 be-

nign samples and 5,660 malware samples and trained a model using SVM. DREBIN

claims to have outperformed several related approaches with detection score of 94%.

The type of features extracted by DREBIN are -

– Hardware Components

– Requested Permissions

– App Components

– Filtered Intents

– Used Permissions

– Suspicious API Calls

– Restricted API Calls

– Network Addresses


• ANASTASIA [9] have extracted 6 type of features using static analysis. Then with

the aim of building a promising model, they have used several machine learning

classifiers like Adaboost RandomForest , K-NN , Logistic Regression, etc. They have

also used feature importance obtained by Extra Trees-Classifier to discard features

below a given threshold value. By evaluating the performance of their framework

on 8,677 malware samples and 11,187 benign samples, they have got a true positive

rate of 97.3%. They claim the results are better then the state-of-the-art detection

methods for Android malware. The type of features extracted by ANASTASIA are-

– Intents


– System Commands

– Suspicious API Calls

– Malicious Activities

• Li et.al. [16] proposes a machine learning algorithm that uses Factorization Machine.

They have evaluated the models performance on two malware datasets - DREBIN

and AMD [35].

– DREBIN : They have collected 5560 malware samples from the DREBIN dataset

and 5600 benign samples from the the Internet and extracted 7 types of features

from them resulting in 93,324 features. They claim a precision of 99.91%

– AMD : They have also collected 24,553 malware samples from the AMD dataset

and 16,753 benign samples from the Internet and extracted 2,94,019 features.

They have got a precision of 99.35% for this dataset.

They extract 7 types of features -

– Restricted APIs

– Suspicious APIs


– App Components

– Hardware Features

– Permissions

– Intent Filter


Summary: In the previous works we observe that the datasets are very small and un-

balanced. This skewness in the dataset results in the machine learning approaches giving

improper models and leading to higher values in evaluation metrics for these datasets.

Also, we observe that the samples gathered are from a very narrow time window as shown

in Table 2.1. Therefore, the machine learning models trained on such datasets are un-

able to generalize and sustain over time. This is because similar malware tend to appear

together in time. Also, the Android versions being targeted, the environment (DVM or

ART), the hardware specific components being targeted (like Bluetooth was more popular

in earlier years, WiFi is more popular now), and the mechanisms used (like SMS, emails)

vary greatly over time. In such a scenario, the machine learning models that are trained

on such limited datasets work with highly specific features and tend to over-fit for that

particular dataset (or time).

Therefore, to counter this, we work with multiple datasets that are spread out over a

period of January 2010 to February 2019. Unlike previous datasets, these are highly class

balanced and therefore it is expected that machine learning models should train better on

them. We have analyzed the instances occurring over time and have identified the only

relevant features - these are the features that are the most informative and important for

malware detection across different years.

Dataset Source Malware Benign Total Malware Collection

1 Genome 1170 1205 2375 August 2010 to October 2011

2 Drebin 5560 1846 7506 August 2010 to October 2012

3 Drebin 5494 5494 10,988 August 2010 to October 2012

4 Genome 550 550 1100 August 2010 to October 2011

5 Genome 1259 1259 2,518 August 2010 to October 2011

6 Drebin 5560 1,23,453 1,29,013 August 2010 to October 2012

7

Genome +

Drebin +

M0Droid +

Virus Total

18,677 11,187 29,864 2009 -2015

8DREBIN

AMD

5660

24,553

5660

16,753

11,320

41,306

August 2010 - October 2012

2010 -2016

Table 2.1: Datasets for previous works

Chapter 3

Dataset

AndroZoo [1] is a growing repository that provides Android applications - both benign

and malware for Android malware analysis. It contains more than 9 million applications

that have been analysed by multiple AV engines for labeling - malware and benign. For

our work, the samples that have been recognised as benign by all the AV engine have

been considered to be benign. As for malware, the samples that have been recognised as

malware by at least 10 different AV engine have been considered to be malware.

For an Android application, dex date represents the applications creation date or classes.dex

last modification date (last date when the app’s code was built). The dex date for an An-

droid application is in the dex file that is present inside the zip. AndroZoo provides

SHA256 of applications along with its dex date. We have used the SHA256 and dex date

to identify unique samples and categorize them based on year respectively.

In this work, available set represents the applications available at AndroZoo repository.

We have collected these applications from the repository and categorised them based on

the dex date to make set of samples for both the classes - malware and benign to create

multiple datasets.

Other than AndroZoo, Android malware samples have also been gathered from DREBIN

and AMD. Each dataset along with its source and characteristics have been explained in

detail in following sections.

18

Chapter 3. Dataset 19

3.1 D/2010-2012

D/2010-2012 is highly class balanced dataset that has a large number of samples from

both the classes - malware and benign for analysis. It contains a total of 3,87,236 samples

evenly distributed among both the classes with the malware class having 1,93,612 samples

and the benign class having 1,93,624 samples. Both the malware and benign samples

have been collected by randomly selecting them from the available set of applications in

AndroZoo repository that have dex date in between January 2010 to December 2012.

3.2 D/2013-2015

D/2013-2015 is also a highly class balanced dataset with the samples size of 3,70,294. It

contains 1,85,181 malware applications and 1,85,113 benign applications. All the samples

have dex date in between January 2013 and December 2015 and have been gathered via

random selection from the available samples set.

3.3 D/2016-2019

D/2016-2019 is again a highly class balanced dataset containing more recent samples. It

contains 66,901 samples where 33,460 are from the benign set and 33,441 are from the

malware set. The samples from both the classes have dex date in between January 2016

to February 2019. All the malware applications present in the available set have been

used where as for the benign applications, we have randomly selected samples from the

available set.

3.4 D/2010-2019

We combine all the samples from the above three datasets to create a final dataset con-

taining 8,24,431 samples where 4,12,234 are malware and 4,12,197 are benign. We use this

dataset to train and test our final model.


3.5 D/DREBIN

D/DREBIN is a dataset that has been created by collecting a total of 11,048 samples. The

dataset contains 5,479 malware from the DREBIN dataset - a popular dataset for malware

analysis. The DREBIN dataset contains malware from 179 different families and have been

collected in the period of August 2010 to October 2012. For the benign samples, we have

used the AndroZoo repository. We have gathered 5,569 samples by randomly selecting

them from the available benign samples. The dex date for all the samples in the benign

class also lies within the same range as the malware class.

3.6 D/AMD

D/AMD contains a total of 48,749 samples where 24,489 are benign and 24,260 are mal-

ware. The set of samples for malware class have been collected from the AMD dataset -

another well accepted and popular dataset among researchers for Android malware analy-

sis that contains more recent samples as compared to the DREBIN dataset. The malware

samples in AMD dataset have been collected in between 2010 to 2016. The benign set

contains samples randomly selected from the available set. All the samples for benign set

also have dex date in between 2010 and 2016.

Table 3.1 shows the summary of the datasets.

Dataset #Malware #Benign #Total Year-Range

D/2010-2012 1,93,612 1,93,624 3,87,236 2010 - 2012

D/2013-2015 1,85,181 1,85,113 3,70,294 2013 - 2015

D/2016-2019 33,441 33,460 66,901 2016 - 2019

D/2010-2019 4,12,234 4,12,197 8,24,431 2010 - 2019

D/DREBIN 5,479 5569 11,048 2010 - 2012

D/AMD 24,260 24,489 48,749 2010 - 2016

Table 3.1: Datasets Summary


Summary: A model built upon samples from a narrow time frame identifies important

features to detect malware from similar time frame. Similarly a model trained using small

set of samples can only detect similar malware. However malware keeps evolving over

time. Models built upon analysing small set of samples or samples from a particular time

frame will not identify such evolved and sophisticated malware. Therefore a large set that

contains samples over the years is needed to built a sustainable malware detection model.

We work with large number of samples that are spread out over a period of 2010 to

February 2019. The large set and the wide range of samples covers analysis on many

variants of malware that otherwise would have been left out. Learning an estimator and

evaluating its performance on all known variants of malware is necessary step to provide

a generalised model that performs and sustains well.

Many features used in Android malware detection are time and size bound. Our analysis

shows there are certain categories of features that preforms well on samples for a particular

time frame but shows poor performance on samples of a different time frame. Similarly

there are categorises that shows good performance with small set of samples compared to

large set of samples and vice versa. Therefore identifying such categories that shows good

detection results on malware of different time frames or different years and on different

sample set size is an important step towards Android malware detection.

To face the above mentioned issues, we have categorised the set of samples into three

categories based on the dex date. We analyse performance of each category using several

machine learning classifiers on several datasets using different evaluation metrics. These

datasets are large and highly class balanced and therefore it is expected that machine

learning models should train better on them.

Chapter 4

Framework Design

We propose a framework for extraction, analysis and refinement of raw data to output the

most relevant information to learn a model that is capable of efficient detection of malware

in the Android environment. The following sections discuss the proposed framework and

its modules in detail.

4.1 Framework

The proposed framework has three modules as shown in Fig 4.1. Each module is a separate

process and can be used independently.

• Information Extraction Module - Information extraction module is responsible for

unpacking/decompiling an Android application and extracting raw data from the

APK. The extracted information is then stored in the Json file format. The informa-

tion extraction module can be used by passing two arguments - path to the directory

containing samples and path to directory to store the extracted data. Section 4.2.1

discusses more about this module.

• Feature Engineering & Analysis Module - This module first constructs meaning-

ful features from the extracted data. Then it implements various feature selection

techniques and performs analysis using several machine learning classifiers. Feature

Engineering & Analysis module can be used by passing two arguments - path to

the directory containing Json format files and path to store the final set of features

and the analysis results. Section 4.2.2 discusses the Feature Engineering & Analysis

Module in depth.

22

Chapter 4. Framework 23

• Detection Model - This module takes the path to set of samples as the argument

and learns a final malware detection model based on those samples for fast, accurate

and efficient malware detection. Section 4.2.3 discusses more about this module.

Feature Engineering

& Analysis

Learning Model

Information Extraction

Three Modules

Model

Figure 4.1: The Proposed Framework

The proposed framework has following advantages over the existing state of the art mal-

ware detection approaches.

• Light Weight - This framework is built on static analysis approach integrated with

highly effective feature engineering techniques resulting in a very light weight model

capable of efficiently detecting malware.

• Generalized and sustainable - Several large and highly class balanced datasets have

been used for the analysis. The dataset taken contains samples - malware and benign

from throughout the years and of as recent as 2019. This wide range of samples covers

varieties of polymorphic, sophisticated and varying malware that has been discovered

over the years.


• Better Detection Performance - Several feature selection techniques have been im-

plemented that reduces noise in the data that otherwise would have led to poor

performance of the model.

• Scalable and Efficient - The proposed framework implements static analysis tech-

niques with the help of a reverse engineering tool - Androguard which unlike dynamic

analysis techniques takes relatively very less amount of time. Therefore the analysis

can be done on a large number of samples. This is important to face the issue of

rapidly growing size/number of malware.

• Novel set of Features - The feature set being used in this work is effective in detec-

tion of Android malware. We demonstrate the detection results of each category of

features using several classifiers in section 6.3.1.

• Less Computation Time and Memory Usage - The proposed framework implements

several highly effective feature selection techniques that outputs the only relevant

and important set of features. Therefore, for analysing an Android application, only

these features need to be extracted and stored from the application which saves a

lot of time and memory

4.2 Framework Modules

The following subsections describe each of the module in depth.

4.2.1 Information extraction module

Among several things, a typical Android application contains application code, resources,

assets, certificates, and manifest file that are packaged into the APK. Using reverse en-

gineering tools like androguard, apkparser etc, there is abundant amount of information

that can be extracted and analysed from these files. The main features extracted for this

work are from the manifest file and the dex code. The proposed framework extracts several

categories of information from these two files as shown in Fig 4.2 .

The first argument to the Information Extraction module is path to the set of samples.

Each sample in the dataset is loaded and passed to androguard for reverse engineering.

Using androguard, first the Android application is unpacked & decompiled and then the


Figure 4.2: Information Extraction Module

manifest file & the dex code are retrieved from the Android application to extract infor-

mation like permissions, API, opcodes, system commands etc. from each sample. The

extracted information is dumped in the directory provided as the second argument while

running this module. Each Json file contains a dictionary with a key - value pair where

key is the type of information extracted like permission, API, activity, intent etc. and

values are all the relevant information extracted in that category.

4.2.2 Feature Engineering & Analysis Module

This module performs an in-depth analysis of the feature sets as shown in Fig 4.3. The

following subsections discuss each process in this module in detail.

Figure 4.3: Feature Engineering & Analysis Module


4.2.2.1 Feature Construction/Representation

The extracted data from the applications are present in files as dictionaries with values as

list of strings. They need to be converted into a different format that represents meaningful

information for enabling machine learning classifiers to understand it.

Due to the high dimensionality of features, we first represent the data in sparse matrix

format. Then after data cleaning step in feature selection process, we represent it in the

dense matrix format. Section 5.1 discusses the details regarding representation of the

features.

4.2.2.2 Feature Selection

Often we have thousands of features and Machine learning is always challenged by such a

large number of features. These features are often irrelevant or redundant which increases

the complexity of the model. Therefore, we need to consider only the most important and

relevant features. Benefits of building a model with such features are -

• It reduces the computational cost and time for training a model.

• It reduces the complexity of a model and makes it easier to interpret.

• It reduces the variance of the model and therefore over-fitting.

• It improves the accuracy of a model if the proper features are chosen.

Feature Selection methods are often used to solve such problems [15]. Fig 4.4 shows the

feature engineering module for this work. This module has five phases listed below and

are discussed in subsequent sections.

• Data Cleaning

• Feature Elimination Based on Frequency of Usage

• Identification of Saturation Point using RFECV

• Extracting Optimal Set of Features using RFE

• Feature selection Based on Co-relation between Features


Figure 4.4: Feature Selection Process

Data Cleaning

In data cleaning step, we search for a final threshold value (by varying c Init) using Equa-

tion 4.1 such that there is a significant drop in number of less or rarely used features. We

discard those features if they do not contribute to any improvement in detection perfor-

mances. Section 5.2 discusses the implementation details for the data cleaning step.

thresholdval = c Init (4.1)


Feature Elimination Based on Frequency of Usage

This step is an extension to the data cleaning step. For each feature in a feature set, we

look at the frequency of usage of the feature by malware as well as benign applications

(how many times the feature has been used by malware and how many time the feature

has been used by benign) and construct multiple c values. This c values serve as a vari-

able in finding the threshold values for both malware and benign applications. Features

with frequency of usage less than the threshold values in both malware and benign respec-

tively and with difference in frequency of usage between malware and benign less than

c/2 are discarded. For every c, we follow this step to reduce features and evaluate the

model performances. The point of saturation in results after which there is no significant

improvement in performance serves as the reduced set of features. The implementation

details for this process is explained in Sec 5.3.

RFECV - Recursive Feature Elimination with Cross Validation

RFECV [27] as the name suggests is the process of recursively eliminating features using

feature ranking. The feature ranking is done on the basis of feature importance. Then

a cross-validated selection of the best number of features is done. We use extra tree as

the base estimator for the RFECV. RFECV takes a classifier C, a scoring function F and

number of features k to eliminate in every step as input parameter. It starts by taking all

the features, computes the performance score and feature rank based on their importance.

Then it eliminates the lowest k features in the ranking and remakes the predictions, the

computation of the performance score and the feature ranking. The iterations stop when

there is no significant improvement in the results. An accuracy vs the number of features

graph is generated. The number of features to be used is chosen from this graph. The

point where there is no significant improvement in results is taken as the count for number

of features to be used.

RFE - Recursive Feature Elimination

RFE [26] is a process for choosing the best set of features in data. It reduces features by

recursively considering smaller sets of features each time to achieve the desired number

of features. Starting with the initial feature set, an estimator is trained on these features

to calculate the importance of each feature in the feature set. Then, depending on the

step parameter (say step = n), n least important features are pruned from current set of

features. This process is repeated on the pruned set until the size of the reduced set is as


same as the desired number of features to select. Using RFE, we reduce the feature set to

an optimal subset of features based on the result gathered from the RFECV.

Feature Reduction Based on Correlation between Features

We remove features that are dependent on each other using a co-relation matrix. Corre-

lation refers to the measurement of linear relationship between two variables [22]. Two

variables that are linearly dependent will have a higher correlation than two variables

which are non-linearly dependent. Features with high correlation are more linearly depen-

dent. Such features have almost the same effect on the dependent variable. So, when two

features have high correlation, we can drop one of the two features.

4.2.2.3 Analysis

This step makes general inferences on the proposed techniques and results. Starting with

the initial set of features and in each step of feature reduction, we have analysed the model

performances by varying several things like metrics for evaluation, learning parameters,

threshold values etc. For few steps, we have also analysed the results using 5 fold cross

validation to generalize the results and to avoid over fitting. All the analysis have been

done on multiple datasets. Section 6.3.1 shows the trends and results of the analysis of

several features sets.

4.2.3 Learning Detection Model Module

The first two modules of the framework analyse the extracted data and the proposed tech-

niques for better detection and output the results. Based on the analysis and the results

from the first two modules, this module identifies the important categories of features and

learns several models based on combination of different categories( like API Package +

Requested Permission, Opcodes + System Commands + Requested Permissions etc.) and

analyses the performance of each of those model. Finally, a lightweight detection model

based on the only relevant and meaningful information extracted from the applications is

learned. We demonstrate the performance of the final model in section 6.3.2 by testing it

on multiple test sets. We also demonstrate the effectiveness of the identified relevant fea-

tures by learning two models on the D/AMD and D/DREBIN dataset respectively based

on these relevant features and present its performance using several evaluation metrics.


Summary: This chapter deals with challenges in Android Malware detection like Curse

of high dimensionality, noise, skewness etc. by proposing a framework that is capable

of efficient, fast, accurate and generalised malware detection. It has several modules

and steps/processes within the modules. The proposed framework performs analysis and

output results using multiple datasets to give a robust model.

Chapter 5

Methodology

This chapter provides an insight into the features constructed for this work, the way they

have been represented, their size and properties. It also shows trends in features related to

frequency of usage, their importance with respect to other features and their contribution

in detection of malware.

5.1 Features

The information extracted from the applications are in the form of lists and dictionaries.

We need to convert them into meaningful features and represent them accordingly to use

with machine learning techniques. This section discusses the feature construction process,

their representation, types, size and the type of values in them.

5.1.1 Feature Construction

In this study, we use the term feature construction to describe the process of conversion

of raw data into meaningful features that can be used with machine learning techniques

to train models. The following subsections describes the types of information that have

been extracted to be used as features.

31

Chapter 5. Methodology 32

5.1.1.1 Permissions

Use of certain combination or setting of permissions often reflect malicious behaviour.

Therefore we have extracted two sets of permissions to identify harmful behaviour.

• Requested Permissions – All requests to any permissions (Android defined and third

party) made by an application must be declared in the manifest file. We have

retrieved all these requested permissions to use as features.

• Used Permissions – When an application requests some resource, package manager

checks whether the required permission has been granted or not. All such permis-

sions that have been declared in the manifest file are used permissions. List of Used

permissions has been considered as features.

Along with the individual permissions, we have also considered total number of requested

permissions, number of AOSP permissions, number of third party permissions as well as

total number of permissions used by an application as features.

5.1.1.2 API

Use of certain packages, classes and methods by an application could show suspicious be-

haviour. Therefore we have also considered APIs as features. There are three types of

API related features -

• API : For each method in a class belonging to an Android defined package , we have

built a string to represent its use. This list of strings have been taken as features[18]

[17].

• API Package : They represent the Android defined packages used by the application.

• Restricted API calls : They represent those API calls for which the required permis-

sion has not been requested. Use of such API calls generally implies some malicious

behaviour.


5.1.1.3 Application Components

Each component defines a user interface or different interfaces to the system. We have

considered all of them as features.

• Activity

• Service

• Content Providers

• Broadcast Receivers

Total number of activities, services, providers and receivers has also been taken as features.

5.1.1.4 Intents

Malware often listens to intents. Therefore we can also use intents to identify malicious

behaviours. There are 3 sets of intents that have been taken as features.

• Intent Filter : These are the Intents present in the manifest file.

• Intent Const [31] : They represent the category of the Intents that are extracted

from the dex file.

• Intent Objects : They represent all the messaging objects in the dex file using which

actions from another application component is requested.

We have also taken the total count of intent filter, intent const and intent object in an

application as features.


When an attacker gains root privileges of the system, it can execute several commands

that can cause harm. Therefore patterns in usage of system commands can also help to

identify malicious behaviours [29].


5.1.1.6 Opcodes

Identifying patterns in usage of opcodes by malware extracted from the classes.dex file

can also help in Android malware detection.

5.1.1.7 Misc Features

Presence of native code, dynamic code loading, reflection, crypto code, total calls to record-

ing category, camera category etc can also help in identifying malicious behaviours.

5.1.2 Feature Representation

For each category, we have built a binary feature matrix around this collected information

where 1 indicates that the application contains a particular information and 0 indicates

otherwise. We have represented the feature matrix in sparse format using CountVectorizer.

For features in API, API Packages, System commands and Opcodes, we take the frequency

of use of the feature by an application as the value in the reduced feature set matrix.

The initial feature matrix has been represented in the row sparse format to save memory

usage. For the reduced set of features after feature selection step, a dense format have

been used.

5.1.3 Types of Feature sets

There are two types of feature sets.

• Categorical Features : It represents the categories that contain lists of strings as

values.

• Miscellaneous Features : It represents the categories that contain numerical values

usually counts.

Table 5.1 shows the list of feature sets and their type.


Categorical Features Miscellaneous Features

Requested Permissions Num Aosp Permissions num activity

Used Permissions Num Third party Permissions num service

API Num Requested Permissions num receiver

API Packages count binary category num providers

Restricted API count dynamic category num intent-filters

Intent Filters count crypto category num intent objects

Intent Objects count network category num intent consts

Intent Const count gps category entropy rate

System Commands count os category ascii obfuscation

Opcodes count io category count sms category

activity count recording category target sdk

service count telephony category eff target sdk

receiver count accounts category min sdk

providers count bluetooth category max sdk

count nfc category reflection code

count display category native code

count content count reflection category

count context APK is signed

count database dynamic code

Table 5.1: List of Feature Sets

5.1.4 Feature size

Each feature set contains different number of features varying from several dozens to sev-

eral lac’s. There are many features that are specific to some applications. Often it results

in a large vector space. For features like API packages, system commands etc, we have

used Android defined names which leads to smaller vector space.

Table 5.2 shows the total number of features in each feature set for three datasets.


Feature set/Dataset D/2010-2012 D/2013-2015 D/2016-2019

Activity 12,53,259 13,61,756 4,15,243

Service 88,575 1,25,578 41,994

Broadcast Receiver 90,081 1,14,134 33,115

Content Provider 9952 10,265 3833

Intent Filters 82,281 85,948 34,200

Intent Objects 1,19,570 1,30,259 33,461

Intent Const 14,409 17,226 4816

Requested Permissions 15,162 39,501 14,748

Used Permissions 60 60 55

Restricted API 1874 1740 398

API 38,747 52,245 68,113

API packages 179 212 232

System Commands 181 193 175

Opcodes 222 222 224

Misc 42 42 42

Total 17,14,594 19,39,381 6,50,649

Table 5.2: Number of Features in each Feature set

A model built upon gathering as much information as possible helps the classifier to learn

more. However there are several disadvantages of using large number of features to build

such models.

• Large amount of resources are needed to deal large number of features.

• The processing time to analyse these features increases.

• With increase in number of features, the computation power needed to build a model

also increases.

Such models are not feasible, scalable or efficient. Therefore we need a mechanism such

that a classifier can be trained using less number of features while maintaining similar

detection results or improving them.


5.2 Data Cleaning

In Android malware analysis, we often encounter two types of strings - Android defined

and custom/third party defined. Android defines a list permissions in the AOSP (Android

Open Source Project). However any app developer can build their own set of permissions

to include in the application. Similarly an activity or a service can be given any name

depending on developers choice. This results in a feature being exclusive to very few

Android applications or sometimes even one application. These custom defined strings

may or may not be informative to detection of Android malware. If for a feature, the

value in most of the samples are zero, then it means this feature is being rarely used or

not being used at all by the applications. This suggests that such features will not help to

gain much information as compared to the features that are being used a lot. When we

have a constant variable value across all samples in the dataset, it does not improve the

power of the model because it has less variance. Therefore, if a feature is being used rarely

or very less number of times as compared to the number of samples present, they can be

dropped with respect to others. Therefore we propose a data cleaning step that eliminates

a feature based on how many times it is being used in the dataset. This filtering helps to

reduce noise in the dataset as well as discard less informative features.

Keeping 1 as the minimum threshold means we consider a feature if it has even used at-

least one sample, taking 2 as the threshold means a particular feature must be used by at

least 2 samples in the whole dataset, other wise we discard it. Similarly, a threshold of 10

means a feature should be used at least 10 times.

To find the final threshold value, we increase the c Init parameter value by one at every

step and look at the resulting number of features. The c Init value after which there is

no significant drop in number of features has been considered as the final threshold value.

We also evaluate the models performance at every step after discarding features to ensure

similar detection results between the model trained using reduced set of features and the

model trained using initial set of features.

There is a huge drop in the number of features for categories like activity, service, re-

ceiver, provider, intent filter, intent object, intent const, requested and API where as used

permission, restricted API, Android API packages, system commands and opcode do not

show any significant drop in number of features. Fig 5.1 shows the trend in drop for some

categories.


Figure 5.1: Drop in Number of Features in Feature sets


5.3 Frequency Of usage

This step also relies on the count of usage of a feature by an application for the process

of elimination. For each categorical feature set, we separate both the samples according

to their labels - malware and benign into two sets and find their corresponding threshold

values (thresholdben and thresholdmal ). We calculate thresholdben by using equation 5.1

and thresholdmal by using equation 5.2. Now given a feature, if this feature is being used

less than thresholdmal in the malware set and less than thresholdben in the benign set and

the difference in their frequency of usage is less than half of current c value chosen, we

drop it. We argue that it is less likely to provide useful information to distinguish between

malware and benign as compared to other features that are used frequently. Such features

are discarded from the feature set. E.g. suppose we have 1000 malware and 1000 benign.

Now if a feature is being used once by malware and twice by benign, such features can

be considered for elimination since it does not help for better classification. However to

avoid loss of meaningful features like a feature being used once in benign and 999 times

in malware, we also look at the difference in frequency of their usage.

To construct values for the c parameter, we create a list by taking the frequency of usage of

all the features in the malware set and frequency of usage of all the features in the benign

set. We also filter out some values from the list for which there is no drop in number

of features as compared to previous c value used to create a final list. These values are

used as variables to evaluate models performance based on which an accuracy vs number

of features graph is plotted. The c value where the improvement saturates is chosen as

the required c-value. This c-value is different for each categorical set. This is one of the

benefits of having a categorical feature set.

While c-parameter tuning, we have used ExtraTree classifier as the base estimator to

evaluate the model performances. Fig 5.2, 5.3 and 5.4 shows the accuracy vs Number of

Features graph for API package, system commands and Requested Permissions categories

respectively.

thresholdben =#BenignSamples

c(5.1)

thresholdmal =#MalwareSamples

c(5.2)


0 25 50 75 100 125 150 175Number of Features

0.945

0.950

0.955

0.960

0.965

0.970

0.975

Accu

racy

(a) D/2010-2012

0 50 100 150 200Number of Features

0.86

0.88

0.90

0.92

0.94

0.96

0.98

Accu

racy

(b) D/2013-2015

0 50 100 150 200Number of Features

0.800

0.825

0.850

0.875

0.900

0.925

0.950

Accu

racy

(c) D/2016-2019

Figure 5.2: Accuracy vs Number of features plot for API Package based on frequencyof usage



0.82

0.84

0.86

0.88

0.90

0.92

Accu

racy

(a) D/2010-2012

0 25 50 75 100 125 150 175 200Number of Features

0.70

0.75

0.80

0.85

0.90

Accu

racy

(b) D/2013-2015


0.70

0.75

0.80

0.85

0.90

Accu

racy

(c) D/2016-2019

Figure 5.3: Accuracy vs Number of Features plot for System Commands based onfrequency of usage


0 100 200 300 400 500Number of Features

0.80

0.82

0.84

0.86

0.88

0.90

0.92

0.94

Accu

racy

(a) D/2010-2012

0 100 200 300 400 500 600Number of Features

0.86

0.87

0.88

0.89

0.90

0.91

0.92

0.93

0.94

Accu

racy

(b) D/2013-2015

0 100 200 300 400 500Number of Features

0.76

0.78

0.80

0.82

0.84

0.86

0.88

Accu

racy

(c) D/2016-2019

Figure 5.4: Accuracy vs Number of Features plot for Requested Permissions based onfrequency of usage


5.4 RFECV

Using Accuracy as the score function, we plot a graph for Accuracy vs Number of Features

using RFECV. We have taken the value for step parameter as 1. This implies RFECV

discards one feature at a time and evaluates the model performances. The features being

discarded have less importance compared to other features. The given function shows the

implementation of the RFECV function.

• selector = RFECV( estimator, step = 1, cv = StratifiedKFold(5),

scoring = accuracy )

Here, Extra Tree Classifier with the 20 estimators and random state as 84 have been used

as the base estimator.

• estimator = ExtraTreesClassifier( n estimators = 20, random state = 84,

n jobs = -1)

We have also used several attributes from RFECV to help us with the analysis.

• selector.support

• selector.grid scores

• selector.estimator

After a thorough analysis, we identify a saturation point for accuracy beyond which there

is no significant improvement in detection results. Fig 5.5, 5.6, 5.7 and 5.8 shows the

accuracy vs Number of Features graph for Requested Permissions, API package, Intent

Const and Opcodes respectively.

5.5 RFE

After identifying optimal number of features to use from the RFECV plot, we use RFE

to reduce to the initial set of features to desired number of features( say x). The RFE

function have been used with the following parameters.

• selector = RFE(estimator, x, step=1)


0 20 40 60 80 100 120Number of features selected

0.80

0.82

0.84

0.86

0.88

0.90

0.92

0.94

Cros

s val

idat

ion

scor

e (n

b of

cor

rect

cla

ssifi

catio

ns)

(a) D/2010-2012


0.82

0.84

0.86

0.88

0.90

0.92

0.94

Cros

s val

idat

ion

scor

e (n

b of

cor

rect

cla

ssifi

catio

ns)

(b) D/2013-2015

0 20 40 60 80 100Number of features selected

0.74

0.76

0.78

0.80

0.82

0.84

0.86

0.88

Cros

s val

idat

ion

scor

e (n

b of

cor

rect

cla

ssifi

catio

ns)

(c) D/2016-2019

Figure 5.5: Accuracy vs Number of Features plot for Requested Permissions usingRFECV


0 5 10 15 20 25 30 35 40Number of features selected

0.84

0.86

0.88

0.90

0.92

0.94

0.96

0.98

Cros

s val

idat

ion

scor

e (n

b of

cor

rect

cla

ssifi

catio

ns)

(a) D/2010-2012


0.80

0.85

0.90

0.95

Cros

s val

idat

ion

scor

e (n

b of

cor

rect

cla

ssifi

catio

ns)

(b) D/2013-2015


0.82

0.84

0.86

0.88

0.90

0.92

0.94

0.96

Cros

s val

idat

ion

scor

e (n

b of

cor

rect

cla

ssifi

catio

ns)

(c) D/2016-2019

Figure 5.6: Accuracy vs Number of Features plot for API Packages using RFECV



0.84

0.86

0.88

0.90

0.92

0.94

0.96

Cros

s val

idat

ion

scor

e (n

b of

cor

rect

cla

ssifi

catio

ns)

(a) D/2010-2012

0 20 40 60 80 100 120 140Number of features selected

0.70

0.75

0.80

0.85

0.90

Cros

s val

idat

ion

scor

e (n

b of

cor

rect

cla

ssifi

catio

ns)

(b) D/2013-2015


0.70

0.75

0.80

0.85

0.90

Cros

s val

idat

ion

scor

e (n

b of

cor

rect

cla

ssifi

catio

ns)

(c) D/2016-2019

Figure 5.7: Accuracy vs Number of Features plot for Intent Const using RFECV


0 10 20 30 40 50 60 70Number of features selected

0.775

0.800

0.825

0.850

0.875

0.900

0.925

0.950

0.975

Cros

s val

idat

ion

scor

e (n

b of

cor

rect

cla

ssifi

catio

ns)

(a) D/2010-2012


0.70

0.75

0.80

0.85

0.90

0.95

Cros

s val

idat

ion

scor

e (n

b of

cor

rect

cla

ssifi

catio

ns)

(b) D/2013-2015


0.80

0.82

0.84

0.86

0.88

0.90

0.92

0.94

0.96

Cros

s val

idat

ion

scor

e (n

b of

cor

rect

cla

ssifi

catio

ns)

(c) D/2016-2019

Figure 5.8: Accuracy vs Number of Features plot for Opcodes using RFECV


5.6 Learning Final Model

We have combined the train and evaluation splits of the D/2010-2012, D/2013-2015 &

D/2016-2019 datasets and trained the final model which detects the Android malware

over time. It is to be noted at any process we did not look at the test split of any

dataset. This final model is trained using the 297 identified features that are critical and

relevant over the years. The performance of this model is shown in section 6.3.2. We show

the performance of our model using different metrics like accuracy, precision, recall and

confusion matrix so that the model’s robust performance can be shown.

5.7 Effectiveness of Identified Features

We have analyzed the instances occurring over time and have identified the only relevant

features - these are the features that are the most informative and important for malware

detection across different years. These features can be used with any set of samples to train

a model for a proper classification. To show the effectiveness of the identified features, we

train two models using train set from the D/DREBIN and D/AMD datasets based on the

identified set of features and evaluate the performance of the models on the unseen test

set of D/DREBIN and D/AMD datasets respectively.

Summary: This chapter provides the implementation details for various modules and

the process & steps with in the modules. It also provides explanation and reasoning for

the performed operations. It also provides the parameters for implementation of various

functions.

Chapter 6

Experimentation and Results

Chapter 6 discusses the experimental setup, analysis results and final model results.

6.1 Experimental setup

This Section discusses the system configuration, parameters for the machine learning clas-

sifiers and datasets distribution.

6.1.1 System

The dataset downloading and code creation have been done on a machine with the config-

uration of - Linux 16.04 , 16 GB RAM, 7 TB Storage and 8 cores. All other process has

been run on a System with the configuration - Linux 16.04.6, 32 GB RAM, 1 TB HDD +

128 GB SSD and 24 cores.

6.1.2 Machine Learning Classifiers

The following classifiers with the given parameters have been used for training the models

during the analysis phase. By observing the performance of each classifier shown in section

6.3.1, we choose Extra Tree classifier as the best one for further in-depth analysis and to

train the final model.

49

Chapter 6. Analysis 50

• LogisticRegression : LogisticRegression(random state = 84)

• RandomForest : RandomForestClassifier(n estimators = 20, random state = 84)

• NeuralNetwork : neural network.MLPClassifier(hidden layer sizes = 3)

• ExtraTrees : ExtraTreesClassifier(n estimators = 20, random state = 84,n jobs=-1)

All the parameters for these classifiers have been randomly chosen without looking at the

evaluation results or by resorting to parameter tuning. However to train the final model,

we tune the “number of estimators” parameter in the ExtraTree classifier.

6.1.3 Dataset Partition

To analyse the proposed framework and to measure the model performances, we have split

the datasets into 3 categories - Train set, Evaluation set and Test set.

• Train set - Train set contains the set of samples which have been used to train/fit

the model.

• Evaluation set - Evaluation set is the set of samples on which frequent evaluation of

the trained model have been performed to improve model performances.

• Test set - Test set contains the set of samples which have been used to get the

unbiased performance of the final model.

The D/2010-2012, D/2013-2015 and D/2016-2019 datasets have been split into 0.60, 0.20

and 0.20 ratio to get the train set, the evaluation set and the test set respectively. We

have used the random.sample() function from the random package in python to obtain the

samples for the test set(20 % of samples). Then, numpy train test split function with 20

as the random state value have been used to split the remaining samples set into 0.75 and

0.25 ratio to get the desired samples for the train and the evaluation set.

We have combined the train set and evaluation set of D/2010-2012, D/2013-2015 and

D/2016-2019 datasets to create a set for D/2010-2019. This set have been split into

0.75 and 0.25 to obtain the train and the evaluation set respectively. The split has been

performed using numpy train test split function . The test set in D/2010-2012, D/2013-

2015 and D/2016-2019 datasets have been used as test sets for D/2010-2019 dataset.


The D/AMD dataset has been split into 0.75 and 0.25 ratio where 75% samples have been

used as train set and the remaining 25% samples have been used as test set. The D/AMD

dataset does not contain the evaluation set.

The D/DREBIN dataset has also been split into 0.75 and 0.25 ratio where 75% samples

have been used as train set and the remaining 25% samples have been used as test set.

Table 6.1 shows the summary of the partition.

DatasetTrain

Mal + BenEvaluationMal + Ben

Total(Train+Evaluation)Mal + Ben

TestMal + Ben

D/2010-2012 2,32,342 77,4483,09,790

1,54,890 + 1,54,90077,446

38,722 + 38,724

D/2013-2015 2,22,177 74,0592,96,236

1,48,145 + 1,48,09174,058

37,036 + 37,022

D/2016-2019 40,141 13,38153,522

26,753 + 26,76913,379

6,688 + 6,691

D/2010-2019 4,94,660 1,64,8886,59,548

3,29,788 + 3,29,7601,64,883

82,446 + 82,437

D/AMD 36,561 -36,561

18,256 + 17,69512,188

6,004 + 6,184

D/DREBIN 8286 -8,286

4,111 + 4,1752,762

1,368 + 1,394

Table 6.1: Partition of Datasets

6.2 Evaluation Metrics

To evaluate the performances of the model during analysis phase and to demonstrate

the final model performances, we have used several popular evaluation metrics that are

commonly used in machine learning. The following subsection provides an overview on the

various metrics that have been used for measuring model performances [14]. This overview

best describes the metrics based on our work.

6.2.1 Confusion Matrix

Confusion Matrix is used to show the performance results of an estimator on evalua-

tion/test set as shown in the Table 6.2.


• True Positives (TP) : In this study, TP represents the correctly classified malware

samples.

• True Negative (TN) : TN represents the correctly classified benign samples.

• False Positive (FP) : FP represents those benign samples that have been classified

as malware.

• False Negative (FN) : FN represents those malware samples that have been classified

as benign.

Predicted Class

Actual ClassClass = No Class = Yes

Class = No TN FPClass = Yes FN TP

Table 6.2: Confusion Matrix

6.2.2 Accuracy

Accuracy is the ratio of correctly classified samples(benign and malware) to the total

number of samples. It is calculated by using eq. 6.1.

Accuracy =TP + TN

TP + FP + TN + FN(6.1)

6.2.3 Precision

Precision is measure of correctly classified malware to total samples that have been clas-

sified as malware. It is calculated by using eq. 6.2.

Precision =TP

TP + FP(6.2)

6.2.4 Recall

Recall is the measure of total samples that have been classified as malware to total malware

samples. It is calculated by using eq. 6.3.

Recall =TP

TP + FN(6.3)


6.2.5 F1 Score

F1 Score is the harmonic mean of precision and recall which says how precise and robust

the model is. It is calculated by using eq. 6.4.

F1Score = 2 ∗ 11

Precision + 1Recall

(6.4)

6.3 Results

6.3.1 Analysis Results

This section discusses the results of the analysis. For all the tables,

• I - I represents the total number of features in the initial set for a category.

• F - F represents the total number of features in the final set for a category.

The analysis have been performed on following categories -

• Requested Permissions , Used Permissions

• Restricted API, API, API Packages

• Service, Receiver, Activity, Providers

• Intent Filters, Intent Objects, Intent Const

• System Commands

• Opcodes

• Misc Features


6.3.1.1 Requested Permissions

The features in requested permission category shows good detection results for both the

initial set and the final set of features in all the datasets. The size of feature set for the

datasets have been reduced by 261, 658 and 247 times respectively while still maintaining

similar detection results. Therefore we consider this category to train our final model. Ta-

ble 6.3 shows the evaluation results and feature size (total number of features) for initial

set of features.

Dataset

Classifiers

D/2010-2012 D/2013-2015 D/2016-2019

I: 15,162 I: 39,501 I: 14,748

Acc Prec Rec F1 Acc Prec Rec F1 Acc Prec Rec F1

LR 88.96 88.04 90.31 89.16 89.64 88.83 90.79 89.8 82.06 87.07 75.23 80.72

RF 93.78 92.21 95.71 93.93 94.39 94.47 94.35 94.41 88.54 94.41 81.9 87.71

NN 90.51 89.85 91.46 90.65 90.87 91.77 89.9 90.82 82.76 90.89 72.75 80.81

ET 93.79 92.35 95.57 93.94 94.42 94.7 94.17 94.43 88.7 94.84 81.81 87.84

Table 6.3: Performance results on evaluation set for requested permissions using initialfeatures

Table 6.4 shows the evaluation results and feature size (total number of features) for final

set of features.

Dataset

Classifiers

D/2010-2012 D/2013-2015 D/2016-2019

F: 58 F: 60 F: 59


LR 88.32 87.34 89.78 88.55 88.98 87.99 90.4 89.18 81.17 85.53 74.96 79.9

RF 93.59 91.96 95.6 93.75 94.17 94.2 94.2 94.2 88.33 94.18 81.67 87.48

NN 89.02 87.74 90.86 89.27 90.46 89.73 91.48 90.6 85.17 91.05 77.96 83.99

ET 93.56 92.16 95.29 93.7 94.2 94.48 93.95 94.21 88.33 94.36 81.49 87.46

Table 6.4: Performance results on evaluation set for requested permissions using finalfeatures


6.3.1.2 Used Permissions

The used permission category have small set of features in both the initial & the final

feature set. The feature set size have also been reduced to less than half while maintain-

ing similar detection results. Although the used permission category shows some decent

results, the performance for both the initial set of features and the final set of features are

not good enough. Used permission category shows good results for accuracy and precision

for all the datasets but the results for other evaluation metrics are relatively poor. There

is also large inconsistency between different evaluation metrics. Table 6.5 shows the per-

formance results for used category using initial set of features.

Dataset

Classifiers

D/2010-2012 D/2013-2015 D/2016-2019

I: 60 I: 60 I: 55


LR 85.81 83.6 89.3 86.36 78.91 81.71 74.76 78.08 81.26 85.08 75.74 80.14

RF 87.71 86.78 89.14 87.95 82.48 87.11 76.44 81.43 84.19 90.78 76.04 82.76

NN 86.24 84.63 88.77 86.65 79.76 84.64 72.96 78.37 81.37 85.2 75.85 80.25

ET 87.74 86.94 89.0 87.96 82.52 87.27 76.36 81.45 84.13 90.81 75.89 82.68

Table 6.5: Performance results on evaluation set for used permissions using initial fea-tures

Table 6.6 shows the performance results for used category using final set of features.

Dataset

Classifiers

D/2010-2012 D/2013-2015 D/2016-2019

I: 19 F: 32 I: 24


LR 85.63 83.36 89.22 86.19 78.85 82.22 73.9 77.84 81.22 85.02 75.71 80.1

RF 87.46 86.95 88.33 87.63 82.57 87.23 76.52 81.52 84.13 90.79 75.89 82.68

NN 85.95 82.14 92.06 86.82 80.12 84.45 74.08 78.92 82.87 87.86 76.21 81.62

ET 87.52 86.63 88.9 87.75 82.46 87.54 75.91 81.3 84.1 90.8 75.83 82.64

Table 6.6: Performance results on evaluation set for used permissions using final features


6.3.1.3 Restricted API

The initial feature set have been reduced by at-least 5 times. The restricted API category

shows relatively better detection results for recent samples as compared to older samples.

The Recall value of D/2013-2015 and 2016-2019 is also best among all other metrics. How-

ever, the overall detection results are not good enough. we do not consider this feature

set because of inconsistent results in the datasets and the evaluation metrics. Table 6.7

shows the performance results for restricted API category using initial set of features.

Dataset

Classifiers

D/2010-2012 D/2013-2015 D/2016-2019

I: 1874 I: 1740 I: 398


LR 74.34 77.37 69.23 73.08 77.33 71.98 89.89 79.94 83.6 81.0 87.71 84.22

RF 79.82 83.17 75.07 78.91 82.79 76.64 94.59 84.67 89.63 87.49 92.44 89.89

NN 76.51 79.91 71.2 75.3 80.23 74.68 91.77 82.35 86.76 85.35 88.71 87.0

ET 79.8 83.18 75.0 78.88 82.76 76.7 94.37 84.62 89.73 87.66 92.44 89.99

Table 6.7: Performance results on evaluation set for restricted API using initial features

Table 6.8 shows the performance results for restricted API category using final set of fea-

tures.

Dataset

Classifiers

D/2010-2012 D/2013-2015 D/2016-2019

F: 66 F: 60 F: 70


LR 73.05 75.99 67.86 71.7 77.33 71.98 89.89 79.94 82.39 79.8 86.66 83.09

RF 79.45 82.93 74.47 78.47 82.79 76.64 94.59 84.67 89.43 87.11 92.51 89.73

NN 75.75 78.12 71.93 74.9 80.23 74.68 91.77 82.35 86.1 82.34 91.85 86.83

ET 79.45 82.97 74.41 78.46 82.76 76.7 94.37 84.62 89.44 87.18 92.44 89.73

Table 6.8: Performance results on evaluation set for restricted API using final features


6.3.1.4 Service

The initial feature size of service is very large compared to above categories. The feature

set size have been reduced by 89, 138 and 44 times respectively and brought down below

1000 features. This category shows good detection results for precision in samples after

2012 but performs poorly for prior samples. The results using different evaluation metric

are also inconsistent. Therefore we do not consider this set. Table 6.9 shows the perfor-

mance results for service category using initial set of features.

Dataset

Classifiers

D/2010-2012 D/2013-2015 D/2016-2019

I: 88,575 I: 1,25,578 I: 41,994


LR 78.1 98.04 57.6 72.57 85.05 97.87 71.82 82.84 82.32 96.83 66.76 79.03

RF 78.63 97.92 58.75 73.44 86.0 98.27 73.44 84.06 83.1 96.66 68.51 80.19

NN 78.7 97.82 58.96 73.58 86.3 97.95 74.29 84.49 83.17 96.28 68.94 80.35

ET 78.65 97.96 58.76 73.46 86.11 98.36 73.59 84.19 83.19 96.73 68.63 80.29

Table 6.9: Performance results on evaluation set for service using initial features

Table 6.10 shows the performance results for service category using final set of features.

Dataset

Classifiers

D/2010-2012 D/2013-2015 D/2016-2019

F: 993 F: 907 F: 938


LR 76.35 98.22 53.95 69.65 79.86 97.77 61.32 75.37 79.41 96.27 61.11 74.76

RF 76.37 98.2 54.0 69.68 79.95 97.68 61.56 75.52 79.69 96.5 61.53 75.15

NN 76.36 98.2 53.99 69.67 79.89 97.46 61.59 75.48 79.54 96.27 61.38 74.96

ET 76.37 98.2 54.0 69.68 79.94 97.63 61.59 75.53 79.65 96.56 61.41 75.08

Table 6.10: Performance results on evaluation set for service using final features


6.3.1.5 Receiver

The initial feature size of receiver is also very large. The feature set size have been reduced

by 95, 123 and 36 times and brought down below 1000 features. Similar to service cate-

gory, this category also shows good detection results for precision. It shows some decent

detection results for samples in between 2013 and 2015 but performs poorly for samples

before 2013 and after 2015. Therefore we do not consider this set for final model. Table

6.11 shows the performance results for receiver category using initial set of features.

Dataset

Classifiers

D/2010-2012 D/2013-2015 D/2016-2019

I: 90,081 I: 1,14,134 I: 33,115


LR 72.3 98.19 45.77 62.43 86.05 98.46 73.38 84.09 79.75 96.93 61.38 75.16

RF 72.31 97.83 45.96 62.54 86.97 98.79 74.98 85.25 79.96 97.1 61.68 75.44

NN 72.48 98.1 46.16 62.78 87.22 98.36 75.84 85.64 79.83 96.93 61.53 75.28

ET 72.3 97.87 45.91 62.5 87.15 98.8 75.34 85.49 79.9 97.21 61.49 75.33

Table 6.11: Performance results on evaluation set for receiver using initial features

Table 6.12 shows the performance results for receiver category using final set of features.

Dataset

Classifiers

D/2010-2012 D/2013-2015 D/2016-2019

F: 947 F: 927 F: 912


LR 70.31 98.46 41.61 58.5 80.9 98.17 63.18 76.88 75.91 96.48 53.7 68.99

RF 70.35 98.54 41.65 58.56 80.96 98.21 63.26 76.96 76.04 96.52 53.94 69.2

NN 70.35 98.54 41.65 58.55 80.91 98.15 63.21 76.9 75.96 96.56 53.74 69.05

ET 70.35 98.56 41.65 58.55 80.97 98.33 63.21 76.95 76.0 96.54 53.85 69.13

Table 6.12: Performance results on evaluation set for receiver using final features


6.3.1.6 Activity

The initial feature size of activity is very large for all the datasets. The feature set size

have been reduced by 1313, 1406 and 457 times and brought down below 1000 features.

This category shows good detection results for initial set of samples. However even though

the final set contains more than 1000 features, the detection results shows significant drop.

To maintain similar detection results between initial set and final set of features, the size

of final set should be large. Therefore we do not consider this feature set due to large size

of final set of features. Table 6.13 shows the detection results for activity using initial set

of features.

Dataset

Classifiers

D/2010-2012 D/2013-2015 D/2016-2019

I: 12,53,259 I: 13,61,756 I: 4,15,243


LR 91.37 97.29 85.21 90.85 92.95 98.19 87.51 92.51 86.85 96.21 76.68 85.34

RF 91.52 96.86 85.92 91.06 93.07 97.94 88.05 92.73 86.25 96.29 75.35 84.54

NN 90.69 97.44 83.68 90.04 93.81 98.40 89.12 93.53 87.54 96.2 78.12 86.22

ET 91.61 96.87 86.11 91.17 93.44 98.03 88.73 93.15 86.91 95.82 77.15 85.47

Table 6.13: Performance results on evaluation set for activity using initial features

Table 6.14 shows the detection results for activity using final set of features.

Dataset

Classifiers

D/2010-2012 D/2013-2015 D/2016-2019

F: 954 F: 968 F:908


LR 88.67 97.08 79.87 87.64 85.28 96.15 73.66 83.41 80.14 95.37 63.27 76.07

RF 88.9 96.85 80.55 87.95 85.67 96.11 74.5 83.94 80.88 96.08 64.3 77.04

NN 88.82 96.99 80.26 87.84 85.41 96.12 73.96 83.6 80.2 95.48 63.33 76.15

ET 88.89 96.9 80.49 87.94 85.66 96.17 74.44 83.92 80.87 96.12 64.26 77.02

Table 6.14: Performance results on evaluation set for activity using final features


6.3.1.7 Providers

The set of features have been reduced to nearly 10 times smaller set. This category does

not good detection results at all. Therefore, we do not consider this feature set for our

model. Table 6.15 shows the performance results for providers category using initial set

of features.

Dataset

Classifiers

D/2010-2012 D/2013-2015 D/2016-2019

I: 9952 I: 10,265 I: 3833


LR 51.01 87.09 3.03 5.86 53.95 97.24 8.6 15.8 60.42 55.97 97.02 70.99

RF 51.61 50.96 99.82 67.48 53.94 97.03 8.61 15.82 60.33 55.91 97.02 70.94

NN 51.62 50.97 99.82 67.48 53.95 97.21 8.61 15.83 60.43 55.98 97.01 70.99

ET 51.61 50.97 99.81 67.48 53.95 97.26 8.6 15.8 60.33 55.91 97.02 70.94

Table 6.15: Performance results on evaluation set for providers using initial features

Table 6.16 shows the performance results for providers category using final set of features.

Dataset

Classifiers

D/2010-2012 D/2013-2015 D/2016-2019

F: 444 F: 418 F: 443


LR 50.82 86.58 2.6 5.05 50.91 85.4 2.66 5.16 60.12 55.78 96.95 70.81

RF 50.82 86.54 2.61 5.06 50.93 84.77 2.73 5.28 60.07 55.75 96.95 70.79

NN 50.82 86.58 2.6 5.05 50.92 85.24 2.69 5.21 60.13 55.79 96.95 70.82

ET 50.82 86.58 2.6 5.05 50.92 85.02 2.7 5.23 60.09 55.76 96.92 70.79

Table 6.16: Performance results on evaluation set for providers using final features


6.3.1.8 Intent Filters

The initial feature size of Intent filter is also very large. The feature set size have been

reduced by 87,88 and 37 times brought down below 1000 features. This category also

shows good detection results for samples after 2012 but performs poorly for prior samples.

Therefore we do not consider this set for final model. Table 6.17 shows the performance

results for this category using initial features.

Dataset

Classifiers

D/2010-2012 D/2013-2015 D/2016-2019

I: 82,281 I: 85,948 I: 34,200


LR 77.94 90.29 62.9 74.14 87.91 92.23 82.93 87.33 79.41 92.17 64.2 75.68

RF 78.89 92.1 63.46 75.14 89.11 93.66 84.02 88.58 81.33 91.73 68.79 78.62

NN 78.41 90.84 63.46 74.72 88.6 93.28 83.31 88.01 79.88 90.62 66.58 76.76

ET 78.9 92.19 63.42 75.15 89.0 93.49 83.95 88.46 81.43 92.12 68.66 78.68

Table 6.17: Performance results on evaluation set for Intent Filters using initial features

Table 6.18 shows the performance results for intent filters category using final set of fea-

tures.

Dataset

Classifiers

D/2010-2012 D/2013-2015 D/2016-2019

F: 938 F: 973 F: 916


LR 76.95 89.34 61.5 72.85 86.24 93.81 77.76 85.03 78.67 91.07 63.49 74.82

RF 78.27 91.54 62.57 74.33 88.85 93.57 83.55 88.28 80.16 91.86 66.1 76.88

NN 77.64 90.21 62.31 73.71 88.12 93.31 82.25 87.44 79.48 92.78 63.85 75.64

ET 78.27 91.62 62.5 74.31 88.82 93.63 83.42 88.23 80.11 92.01 65.87 76.78

Table 6.18: Performance results on evaluation set for Intent Filters using final features


6.3.1.9 Intent Objects

The initial feature size of Intent object is also very large. The feature set size have been

reduced by 123, 131 and 33 times and brought down below 1000 features. This category

does not good detection results for samples in all the datasets. Therefore due to poor

results and large size of final set of features, we do not consider this set for final model.

Table 6.19 shows the detection results on the initial set of features.

Dataset

Classifiers

D/2010-2012 D/2013-2015 D/2016-2019

I: 1,19,570 I: 1,30,259 I: 33,461


LR 65.19 98.2 31.36 47.54 66.07 66.57 65.24 65.9 76.77 76.67 76.83 76.75

RF 65.18 98.14 31.36 47.54 66.63 66.69 67.12 66.91 78.67 79.37 77.37 78.36

NN 65.19 97.92 31.45 47.61 66.59 66.78 66.72 66.75 78.01 78.65 76.77 77.7

ET 65.17 98.15 31.33 47.5 66.63 66.68 67.13 66.9 78.75 79.48 77.42 78.43

Table 6.19: Performance results on evaluation set for Intent Objects using initial features

Table 6.20 shows the detection results on the final set of features.

Dataset

Classifiers

D/2010-2012 D/2013-2015 D/2016-2019

F: 972 F: 994 F: 986


LR 63.41 98.17 27.76 43.28 63.44 64.51 60.57 62.48 72.51 73.12 71.04 72.06

RF 63.42 98.26 27.76 43.29 64.18 64.82 62.83 63.81 75.52 69.8 89.79 78.54

NN 63.43 98.3 27.76 43.29 63.89 65.24 60.24 62.64 74.73 69.31 88.62 77.78

ET 63.42 98.18 27.77 43.3 64.19 64.84 62.8 63.8 75.51 69.8 89.76 78.53

Table 6.20: Performance results on evaluation set for Intent Objects using final features


6.3.1.10 Intent Const

Intent const category shows very good detection results for all the datasets. It also shows

consistency in different evaluation metrics. The finals set of features have also been re-

duced by 240, 191 and 72 times and brought down below 100 features. Therefore, we

consider this set to train our final detection model. Table 6.21 shows the detection results

for intent const category using initial set of features.

Dataset

Classifiers

D/2010-2012 D/2013-2015 D/2016-2019

I: 14,409 I: 17,226 I: 4816


LR 94.19 94.77 93.61 94.19 86.24 86.18 86.47 86.33 87.44 87.6 87.18 87.39

RF 96.39 96.16 96.68 96.42 91.54 90.25 93.23 91.72 93.01 93.04 92.95 93.0

NN 95.33 94.88 95.9 95.39 88.36 88.17 88.75 88.46 90.37 89.83 91.0 90.41

ET 96.36 96.23 96.55 96.39 91.56 90.24 93.31 91.75 93.21 93.29 93.08 93.19

Table 6.21: Performance results on evaluation set for Intent Const using initial features

Table 6.22 shows the detection results for intent const category using final set of features.

Dataset

Classifiers

D/2010-2012 D/2013-2015 D/2016-2019

F: 60 F: 90 F: 66


LR 93.65 94.35 92.94 93.64 85.31 85.0 85.94 85.47 85.17 85.81 84.2 85.0

RF 96.29 96.05 96.6 96.32 91.33 89.78 93.38 91.55 92.53 92.42 92.62 92.52

NN 94.71 93.8 95.82 94.8 87.16 86.79 87.81 87.3 87.86 87.98 87.65 87.81

ET 96.27 96.12 96.48 96.3 91.39 90.0 93.22 91.58 92.63 92.58 92.66 92.62

Table 6.22: Performance results on evaluation set for Intent Const using final features


6.3.1.11 API

API category also shows very good detection results and consistency in different evaluation

metrics for all the datasets. The finals set of features have also been reduced by 2152,

2612 and 2270 times and brought down below 30 features. Therefore, we consider this set

to train our final detection model. Table 6.23 shows the detection results for API category

using initial set of features.

Dataset

Classifiers

D/2010-2012 D/2013-2015 D/2016-2019

I: 38,747 I: 52,245 I: 68,113


LR 97.82 97.35 98.34 97.84 97.47 97.05 97.95 97.5 97.5 97.2 97.81 97.51

RF 97.93 97.93 97.96 97.94 97.62 97.72 97.54 97.63 97.03 97.02 97.04 97.03

NN 97.74 97.64 97.87 97.76 97.8 97.24 98.41 97.82 97.32 97.02 97.63 97.33

ET 97.92 97.94 97.93 97.93 97.53 97.8 97.26 97.53 97.03 97.15 96.9 97.02

Table 6.23: Performance results on evaluation set for API using initial features

Table 6.24 shows the detection results for API category using final set of features.

Dataset

Classifiers

D/2010-2012 D/2013-2015 D/2016-2019

F: 18 F: 20 F: 30


LR 89.61 92.76 86.07 89.29 83.18 88.45 76.52 82.05 90.33 89.72 91.06 90.38

RF 97.72 97.56 97.9 97.73 97.12 97.31 96.96 97.13 96.83 96.88 96.77 96.82

NN 89.34 93.39 84.8 88.89 88.19 86.3 90.93 88.56 91.22 91.36 91.02 91.19

ET 97.66 97.57 97.78 97.68 97.21 97.42 97.02 97.22 96.95 97.04 96.84 96.94

Table 6.24: Performance results on evaluation set for API using final features


6.3.1.12 API packages

The API package category shows very good detection results with very less number of fea-

tures in the final set. The performance results are also consistent throughout the datasets

and for all evaluation metrics. Therefore we consider this category for our final model.

Table 6.25 shows the performance results for this category using initial set of features.

Dataset

Classifiers

D/2010-2012 D/2013-2015 D/2016-2019

I: 179 I: 212 I: 232


LR 91.01 91.78 90.21 90.99 87.67 88.83 86.33 87.56 85.71 87.42 83.36 85.34

RF 97.48 97.63 97.35 97.49 97.67 97.93 97.42 97.67 96.25 96.62 95.84 96.23

NN 90.25 87.05 94.7 90.34 90.54 90.2 90.37 90.28 49.79 49.84 98.14 66.11

ET 97.31 97.7 96.93 97.32 97.55 98.0 97.11 97.55 96.2 96.73 95.63 96.17

Table 6.25: Performance results on evaluation set for API packages using initial features

Table 6.26 shows the performance results for this category using final set of features.

Dataset

Classifiers

D/2010-2012 D/2013-2015 D/2016-2019

F: 15 F: 20 F: 25


LR 88.25 91.11 84.91 87.9 80.18 83.92 74.91 79.16 82.77 80.79 85.89 83.26

RF 97.31 97.31 97.35 97.33 97.27 97.57 96.98 97.27 96.26 96.52 95.96 96.24

NN 90.35 89.13 92.03 90.56 77.49 89.54 62.51 73.62 61.69 98.99 23.48 37.96

ET 97.22 97.39 97.07 97.23 97.35 97.64 97.08 97.36 96.23 96.82 95.6 96.2

Table 6.26: Performance results on evaluation set for API packages using final features



The System Commands category also shows very good detection results in all the datasets

for all the evaluation metrics with very less number features. Therefore we consider this

set for our final model. Table 6.27 shows the performance results for this category using

initial set of features.

Dataset

Classifiers

D/2010-2012 D/2013-2015 D/2016-2019

I-: 181 I: 193 I: 175


LR 82.04 85.46 77.48 81.27 80.26 86.74 71.68 78.49 81.72 82.73 80.1 81.39

RF 92.48 92.82 92.18 92.5 91.83 95.14 88.24 91.56 92.03 91.68 92.42 92.05

NN 86.5 89.19 83.24 86.11 83.09 85.65 79.69 82.57 85.35 85.45 85.15 85.3

ET 92.47 93.01 91.93 92.47 91.73 95.09 88.1 91.46 92.03 91.84 92.21 92.03

Table 6.27: Performance results on evaluation set for system commands using initialfeatures

Table 6.28 shows the performance results for this category using final set of features.

Dataset

Classifiers

D/2010-2012 D/2013-2015 D/2016-2019

F: 42 F:42 F: 42


LR 81.36 84.84 76.62 80.52 78.75 84.74 70.38 76.9 80.31 81.12 78.9 80.0

RF 92.25 92.85 91.65 92.25 91.51 94.86 87.87 91.23 91.93 91.64 92.24 91.94

NN 86.73 88.04 85.19 86.59 82.94 87.43 77.15 81.97 84.3 85.91 81.99 83.9

ET 92.19 92.99 91.36 92.17 91.45 94.88 87.71 91.15 91.82 91.56 92.11 91.83

Table 6.28: Performance results on evaluation set for system commands using Finalfeatures


6.3.1.14 Opcodes

The final feature set size in opcode have been reduced below 50. The Opcode category also

shows very good detection results with very less number of features. The results are also

consistent. Therefore we consider this set for our final model. Table ?? shows performance

results on evaluation set for opcodes using initial features.

Dataset

Classifiers

D/2010-2012 D/2013-2015 D/2016-2019

I: 222 I: 222 I: 224


LR 88.26 85.98 91.59 88.7 85.66 84.97 86.82 85.88 85.91 89.05 81.82 85.28

RF 96.95 97.46 96.45 96.95 96.74 97.15 96.35 96.75 95.95 96.44 95.4 95.92

NN 91.84 90.0 94.25 92.07 87.14 82.91 93.72 87.99 50.01 49.96 99.94 66.62

ET 96.79 97.46 96.12 96.79 96.58 97.12 96.05 96.58 95.8 96.59 94.94 95.76

Table 6.29: Performance results on evaluation set for opcodes using initial features

Table 6.30 shows performance results on evaluation set for opcodes using final features.

Dataset

Classifiers

D/2010-2012 D/2013-2015 D/2016-2019

F: 30 I: 42 F: 30


LR 86.07 85.95 86.42 86.19 81.18 79.5 84.27 81.82 81.19 85.63 74.87 79.89

RF 96.42 97.09 95.76 96.42 96.26 96.56 95.99 96.27 95.16 95.72 94.53 95.13

NN 85.44 83.21 89.01 86.01 83.28 82.22 85.14 83.66 61.72 98.5 23.66 38.16

ET 96.4 97.18 95.6 96.39 96.29 96.68 95.9 96.29 95.28 96.07 94.4 95.23

Table 6.30: Performance results on evaluation set for opcodes using final features


6.3.1.15 Misc Features

The Misc Features category also shows very good detection results with very less number

features. Therefore we consider this set for our final model. Table 6.31 shows the detection

results for this category using initial set of features.

Dataset

Classifiers

D/2010-2012 D/2013-2015 D/2016-2019

I: 42 I: 42 I: 42


LR 88.79 90.42 86.91 88.63 88.32 89.79 86.61 88.17 86.56 87.02 85.86 86.44

RF 97.1 96.94 97.31 97.12 97.45 97.28 97.66 97.47 96.62 97.01 96.2 96.6

NN 90.89 90.34 91.7 91.01 90.63 88.84 93.04 90.89 83.39 87.42 77.93 82.4

ET 97.07 97.07 97.12 97.09 97.41 97.41 97.44 97.43 96.55 97.13 95.93 96.53

Table 6.31: Performance results on evaluation set for Misc Features using initial features

Table 6.32 shows the detection results for this category using final set of features.

Dataset

Classifiers

D/2010-2012 D/2013-2015 D/2016-2019

F: 24 I: 26 I: 30


LR 88.02 89.19 86.7 87.92 87.94 89.39 86.25 87.79 86.16 86.55 85.56 86.05

RF 97.04 96.84 97.28 97.06 97.46 97.24 97.73 97.48 96.5 96.87 96.08 96.47

NN 90.34 89.69 91.3 90.48 88.23 89.85 86.34 88.06 50.2 50.05 99.87 66.68

ET 96.95 96.98 96.95 96.97 97.33 97.23 97.46 97.34 96.67 97.26 96.03 96.64

Table 6.32: Performance results on evaluation set for Misc Features using final features


6.3.2 Final Results

Based on the category wise results analysis, we identify 7 categories as relevant. We use

D/2010-2019 dataset and Extra Tree Classifier from here on for further analysis and for

the generation of the final model.

6.3.2.1 Base Model Results

For the D/2010-2019 dataset, the initial set contains 40,00,724 features when we consider

all the categories. The detection results on evaluation set using Extra Tree classifier are -

• Acc - 97.63

• Prec - 98.04

• Rec - 97.21

• F1 - 97.62

When considering only the 7 relevant categories, the initial set contains 1,76,656 features.

The detection results on evaluation set are -

• Acc - 98.0

• Prec - 98.21

• Rec - 97.79

• F1 - 98.0

6.3.2.2 Individual Category Results

For D/2010-2012 dataset, we identify the important features in each of the 7 category.

Table 6.33 shows the performance results for the 7 identified categories using initial and

final set of features.


I Acc Prec Rec F1 F Acc Prec Rec F1

API Packages 238 97.32 97.6 97.03 97.31 24 97.06 97.3 96.82 97.06

Opcodes 225 96.61 97.25 95.94 96.59 45 96.34 96.87 95.78 96.32

System Commands 209 91.25 94.78 87.32 90.9 45 91.05 94.54 87.16 90.7

API 78199 97.89 98.02 97.75 97.89 35 97.44 97.66 97.21 97.43

Requested Perm 66219 93.15 93.02 93.33 93.17 60 92.89 92.98 92.8 92.89

Intent Const 31567 90.23 88.7 92.23 90.43 60 89.61 94.5 84.14 89.02

Misc Feat 42 97.1 96.99 97.23 97.11 30 97.1 96.98 97.23 97.11

Table 6.33: Performance results of relevant categories on evaluation set

6.3.2.3 Combined Category Results

After identifying important features in each of the relevant categories that shows robust-

ness, generalised, efficient and better detection results, we explore the idea behind com-

bining the detection ability of each category. Table 6.34 shows the performance results on

several possible combinations of the 7 identified categories.

6.3.2.4 Results on Test set

Finally, we built an Android malware detection model using samples from D/2010-2019.

The model has been created based on the identified important features. The performance

of this model on different test sets is shown in Table 6.35. The performance results on

different evaluation metrics and for all the tests sets are consistent. This suggests that

the built model is robust and sustainable. It is capable of effective detection of Android

malware.

6.3.2.5 Effectiveness of Identified Features

The identified features are the features that are most informative and important for mal-

ware detection for any set of samples. To demonstrate the effectiveness of identified fea-

tures for Android malware detection, we train two models using D/DREBIN dataset train


set and D/AMD dataset train set and show the effectiveness of these models using their

respective test sets. Table 6.36 shows the performance results of these two models.

Summary: We have analysed several categories of features. Our analysis shows that some

categories require many features to show good detection results and some require very less

number of features. Some categories shows good detection for samples within a certain

time frame only and some shows good results for only certain evaluation metrics. Therefore

to build a light weight model that will sustain over time, we have analysed and identified

the only relevant feature and built a detection model based on it. The performance results

of this model on various test sets suggests that this model is robust and will sustain over

time. Also, to show the effectiveness of the identified features, we have experimented on

two more datasets. The performance results of the corresponding models built on these

datasets shows that these set of features results in a effective malware detection model.


Categories F Acc Prec Rec F1

RP + IC + API 153 97.98 97.87 98.1 97.99

RP + IC + AP 142 97.97 97.94 98.01 97.98

IC + API + M 123 97.91 97.97 97.86 97.92

RP + IC + API + AP 177 98.05 97.96 98.16 98.06

RP + IC + API + SC 198 98.04 97.92 98.17 98.04

RP + IC + API + M 183 97.99 97.88 98.11 98.0

RP + IC + AP + SC 187 97.99 97.92 98.06 97.99

RP + IC + AP + M 172 97.99 97.95 98.04 98.0

RP + IC + SC + OP 208 97.87 97.87 97.89 97.88

RP + IC + SC + M 193 97.89 97.77 98.02 97.9

RP + API + AP + SC 162 97.89 97.77 98.03 97.9

RP + API + AP + M 147 97.89 97.81 97.99 97.9

RP + API + SC + M 168 97.88 97.75 98.02 97.88

IC + API + AP + SC 162 97.86 98.01 97.7 97.85

IC + API + AP + M 147 97.99 98.11 97.87 97.99

IC + API + SC + M 168 97.91 97.95 97.88 97.91

IC + API + OP + M 168 97.88 98.01 97.75 97.88

IC + AP + SC + M 157 97.86 98.04 97.69 97.86

RP + IC + API + AP + SC 222 98.1 98.02 98.19 98.1

RP + IC + API + AP + OP 222 98.07 98.06 98.09 98.07

RP + IC + API + AP + M 207 98.12 98.06 98.2 98.13

RP + IC + API + SC + OP 243 98.02 97.93 98.12 98.03

RP + IC + API + SC + M 228 98.03 97.91 98.16 98.03

RP + IC + API + OP + M 228 98.01 97.97 98.05 98.01

RP + IC + AP + SC + OP 232 97.98 97.96 98.0 97.98

RP + IC + AP + SC + M 217 98.02 98.0 98.05 98.02

RP + IC + AP + OP + M 217 97.99 98.02 97.96 97.99

RP + IC + SC + OP + M 238 97.95 97.9 98.0 97.95

RP + API + AP + SC + OP 207 97.9 97.83 97.98 97.91

RP + API + AP + SC + M 192 97.93 97.86 98.0 97.93

IC + API + AP + SC + M 192 98.02 98.12 97.92 98.02

IC + API + AP + OP + M 192 97.92 98.1 97.74 97.92

RP + IC + API + AP + SC + OP 267 98.08 98.07 98.11 98.09

RP + IC + API + AP + SC + M 252 98.1 98.03 98.18 98.1

RP + IC + API + SC + OP + M 273 98.02 97.95 98.1 98.03

IC + API + AP + SC + OP + M 237 97.95 98.07 97.84 97.95

RP + IC + API + AP + SC + OP + M 297 98.11 98.08 98.14 98.11

Table 6.34: Performance results using combined categories


Test Set Acc Prec Rec F1 TN TP FN FP

D/2010-2012 98.06 97.77 98.40 98.06 37,841 38,105 617 883

D/2013-2015 98.72 98.81 98.62 98.72 36,585 36,526 510 437

D/2016-2019 97.65 97.52 97.78 97.65 6525 6540 148 166

Table 6.35: Performance results of final model on different test sets

Test Set Acc Prec Rec F1 TN TP FN FP

D/AMD 99.61 99.61 99.60 99.60 6,161 5,980 24 23

D/DREBIN 98.08 98.88 97.22 98.04 1,379 1,330 38 15

Table 6.36: Performance results to show effectiveness of identified features

Chapter 7

Conclusion and Future Work

7.1 Conclusion

In this thesis, we have built a light weight malware detection model that is capable of fast,

generalized, accurate and efficient detection of Android malware. To fulfill this objective,

we have designed a framework that deals with the challenges faced in Android malware

detection. We have worked with more than 8 lac’s of samples that are spread out over

a period of 2010 to 2019. We have created multiple datasets to perform parallel analysis

for more robustness. Unlike previous datasets, the datasets we work on are highly class

balanced and therefore it is expected that machine learning models should train better on

them. We have extracted several categories of information like permissions, API, Intents,

App Components etc. and analysed their effectiveness towards Android malware detection.

We have used multiple evaluation metrics for a proper measure of performance results.

Then with the aim of creating a light weight detection model, we have implemented three

feature selection techniques and identified the only relevant features - these are the features

that are the most informative and important for malware detection for samples across

different years. Finally, we have identified the relevant sets of features and built a model

capable of temporally robust detection of Android malware.

7.2 Future Work

In this work, we have only analysed features extracted using static analysis. More features

can also be extracted using dynamic analysis to create a more informative feature set.

74

Bibliography

[1] Allix, K., Bissyande, T. F., Klein, J., and Le Traon, Y. (2016). Androzoo: Collecting

millions of android apps for the research community. In 2016 IEEE/ACM 13th Working

Conference on Mining Software Repositories (MSR), pages 468–471. IEEE.

[2] Android (2018). Platform Architecture. https://developer.android.com/guide/

platform, [last accessed on 22 May 2019].

[3] Arp, D., Spreitzenbarth, M., Hubner, M., Gascon, H., Rieck, K., and Siemens, C.

(2014). Drebin: Effective and explainable detection of android malware in your pocket.

In Ndss, volume 14, pages 23–26.

[4] AV-TEST (May 2019). Total Malware - Av-test. https://www.av-test.org/en/

statistics/malware/, [last accessed on 10 May 2019].

[5] Bacci, A., Bartoli, A., Martinelli, F., Medvet, E., Mercaldo, F., and Visaggio, C. A.

(2018). Impact of code obfuscation on android malware detection based on static and

dynamic analysis. In ICISSP, pages 379–385.

[6] Darknet (November 2016). Androguard – Reverse Engineering & Mal-

ware Analysis For Android. https://www.darknet.org.uk/2016/11/

androguard-reverse-engineering-malware-analysis-for-android/, [last ac-

cessed on 11 May 2019].

[7] Desnos, A. et al. (2011). Androguard.

[8] Feizollah, A., Anuar, N. B., Salleh, R., Suarez-Tangil, G., and Furnell, S. (2017).

Androdialysis: Analysis of android intent effectiveness in malware detection. computers

& security, 65:121–134.

[9] Fereidooni, H., Conti, M., Yao, D., and Sperduti, A. (2016). Anastasia: Android

malware detection using static analysis of applications. In 2016 8th IFIP International

Conference on New Technologies, Mobility and Security (NTMS), pages 1–5. IEEE.

75

https://developer.android.com/guide/platform

https://developer.android.com/guide/platform

https://www.av-test.org/en/statistics/malware/

https://www.av-test.org/en/statistics/malware/

https://www.darknet.org.uk/2016/11/androguard-reverse-engineering-malware-analysis-for-android/

https://www.darknet.org.uk/2016/11/androguard-reverse-engineering-malware-analysis-for-android/

Bibliography 76

[10] Firdaus, A. and Anuar, N. (2015). Root-exploit malware detection using static analy-

sis and machine learning. In Proceedings of the fourth international conference on Com-

puter Science & Computational Mathematics (ICCSCM 2015). Langkawi, Malaysia,

pages 177–183.

[11] Gaviria de la Puerta, J. and Sanz, B. (2017). Using dalvik opcodes for malware

detection on android. Logic Journal of the IGPL, 25(6):938–948.

[12] Goel, D. (May 2018). Understanding Android Architecture. https://medium.com/

@deepamgoel/understanding-android-architecture-1f0fb4b52f90, [last accessed

on 11 May 2019].

[13] IDC (2019). Smartphone market share. https://www.idc.com/promo/

smartphone-market-share/os, [last accessed on 10 May 2019].

[14] Joshi, R. (September 2016). Interpretation of per-

formance measures. https://blog.exsilio.com/all/

accuracy-precision-recall-f1-score-interpretation-of-performance-measures/,

[last accessed on 20 May 2019].

[15] Kaushik, S. (December 2016). Introduction To Feature Selec-

tion Methods. https://www.analyticsvidhya.com/blog/2016/12/

introduction-to-feature-selection-methods-with-an-example-or-how-to-\

select-the-right-variables/, [last accessed on 20 May 2019].

[16] Li, C., Zhu, R., Niu, D., Mills, K., Zhang, H., and Kinawi, H. (2018). Android

malware detection based on factorization machine. arXiv preprint arXiv:1805.11843.

[17] Martın Garcıa, A., Lara-Cabrera, R., and Camacho, D. (2018a). Android malware de-

tection through hybrid features fusion and ensemble classifiers: The andropytool frame-

work and the omnidroid dataset. Information Fusion, 52.

[18] Martın Garcıa, A., Lara-Cabrera, R., and Camacho, D. (2018b). A new tool for static

and dynamic android malware analysis. pages 509–516.

[19] Neil (January 2019). An Overview of The Android Architecture. https://www.

techotopia.com/index.php/An_Overview_of_the_Android_Architecture, [last ac-

cessed on 20 May 2019].

[20] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O.,

Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A.,

https://medium.com/@deepamgoel/understanding-android-architecture-1f0fb4b52f90

https://medium.com/@deepamgoel/understanding-android-architecture-1f0fb4b52f90

https://www.idc.com/promo/smartphone-market-share/os

https://www.idc.com/promo/smartphone-market-share/os

https://blog.exsilio.com/all/accuracy-precision-recall-f1-score-interpretation-of-performance-measures/

https://blog.exsilio.com/all/accuracy-precision-recall-f1-score-interpretation-of-performance-measures/

https://www.analyticsvidhya.com/blog/2016/12/introduction-to-feature-selection-methods-with-an-example-or-how-to-\select-the-right-variables/



https://www.techotopia.com/index.php/An_Overview_of_the_Android_Architecture

https://www.techotopia.com/index.php/An_Overview_of_the_Android_Architecture

Bibliography 77

Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. (2011). Scikit-learn:

Machine learning in Python.

[21] Protalinski, E. (May 2019). Android passes 2.5 billion monthly ac-

tive devices - Venturebeat. https://venturebeat.com/2019/05/07/

android-passes-2-5-billion-monthly-active-devices/, [last accessed on 10

May 2019].

[22] R, V. (September 2018). Feature selection - Cor-

relation and P-value. https://towardsdatascience.com/

feature-selection-correlation-and-p-value-da8921bfb3cf, [last accessed

on 20 May 2019].

[23] scikit (2019a). Extra Tree. https://scikit-learn.org/stable/modules/

generated/sklearn.ensemble.ExtraTreesClassifier.html, [last accessed on 22

May 2019].

[24] scikit (2019b). Neural Network. https://scikit-learn.org/stable/modules/

generated/sklearn.neural_network.MLPClassifier.html, [last accessed on 22 May

2019].

[25] scikit (2019c). Random Forest. https://scikit-learn.org/stable/modules/

generated/sklearn.ensemble.RandomForestClassifier.html, [last accessed on 22

May 2019].

[26] scikit (2019d). RFE. https://scikit-learn.org/stable/modules/generated/

sklearn.feature_selection.RFE.html, [last accessed on 19 May 2019].

[27] scikit (2019e). RFECV. https://scikit-learn.org/stable/modules/generated/

sklearn.feature_selection.RFECV.html, [last accessed on 19 May 2019].

[28] scikit learn (2019). SVM. https://scikit-learn.org/stable/modules/

generated/sklearn.svm.SVC.html, [last accessed on 22 May 2019].

[29] simoatze (2019). maline. https://github.com/soarlab/maline/tree/master/

data, [last accessed on 19 May 2019].

[30] Spyder (2018). Spyder - The Scientific Python Development Environment. https:

//www.spyder-ide.org/, [last accessed on 19 May 2019].

[31] Suarez-Tangil, G., Dash, S. K., Ahmadi, M., Kinder, J., Giacinto, G., and Cavallaro,

L. (2017). Droidsieve: afast and accurate classification of obfuscated android malware.

https://venturebeat.com/2019/05/07/android-passes-2-5-billion-monthly-active-devices/

https://venturebeat.com/2019/05/07/android-passes-2-5-billion-monthly-active-devices/

https://towardsdatascience.com/feature-selection-correlation-and-p-value-da8921bfb3cf

https://towardsdatascience.com/feature-selection-correlation-and-p-value-da8921bfb3cf

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html

https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html

https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html

https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html

https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFECV.html

https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFECV.html

https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

https://github.com/soarlab/maline/tree/master/data

https://github.com/soarlab/maline/tree/master/data

https://www.spyder-ide.org/

https://www.spyder-ide.org/

Bibliography 78

In Proceedings of the Seventh ACM on Conference on Data and Application Security

and Privacy, pages 309–320. ACM.

[32] Sun, L., Li, Z., Yan, Q., Srisa-an, W., and Pan, Y. (2016). Sigpid: significant permis-

sion identification for android malware detection. In 2016 11th international conference

on malicious and unwanted software (MALWARE), pages 1–8. IEEE.

[33] Thakran, S. (May 2017). New Android Malware Samples Found Ev-

ery 10 Seconds, Claims G Data. https://gadgets.ndtv.com/apps/news/

new-android-malware-samples-detected-every-10-seconds-report-1689991,


[34] Wang, X., Wang, J., and Xiaolan, Z. (2016). A static android malwar detection

based on actual used permissions combination and api calls. International Journal of

Computer, Electrical, Automation, Control and Information Engineering, 10(9):1486–

1493.

[35] Wei, F., Li, Y., Roy, S., Ou, X., and Zhou, W. (2017). Deep ground truth analysis of

current android malware. In International Conference on Detection of Intrusions and

Malware, and Vulnerability Assessment (DIMVA’17), pages 252–276, Bonn, Germany.

Springer.

[36] Whittaker, Z. (March 2019). New Android adware found in 200 apps on Google Play

Store. https://techcrunch.com/2019/03/13/new-android-adware-google-play/,


[37] Wikipedia (March 2019b). Mobile Malware. https://en.wikipedia.org/wiki/

Mobile_malware, [last accessed on 11 May 2019].

[38] Wikipedia (March 2019c). Statistical Classification. https://en.wikipedia.org/

wiki/Statistical_classification, [last accessed on 16 May 2019].

[39] Wikipedia (May 2019a). Android Application Package. https://en.wikipedia.

org/wiki/Android_application_package, [last accessed on 11 May 2019].

[40] Zhu, T. T. (2017). An Analysis of Mobile Malware evolution.

https://gadgets.ndtv.com/apps/news/new-android-malware-samples-detected-every-10-seconds-report-1689991

https://gadgets.ndtv.com/apps/news/new-android-malware-samples-detected-every-10-seconds-report-1689991

https://techcrunch.com/2019/03/13/new-android-adware-google-play/

https://en.wikipedia.org/wiki/Mobile_malware

https://en.wikipedia.org/wiki/Mobile_malware

https://en.wikipedia.org/wiki/Statistical_classification

https://en.wikipedia.org/wiki/Statistical_classification

https://en.wikipedia.org/wiki/Android_application_package

https://en.wikipedia.org/wiki/Android_application_package

Feature Engineering & Analysis Towards Temporally Robust ... › sites › default › files ›...

Documents

Transcript of Feature Engineering & Analysis Towards Temporally Robust ... › sites › default › files ›...