Feature Engineering & Analysis Towards Temporally Robust ... › sites › default › files ›...
Transcript of Feature Engineering & Analysis Towards Temporally Robust ... › sites › default › files ›...
Feature Engineering & Analysis TowardsTemporally Robust Detection of Android
Malware
A thesis submitted in partial fulfilment of the requirements
for the degree of Master of Technology
by
Sagar Jaiswal
17111037
under the guidance of
Prof. Sandeep K. Shukla
to the
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
INDIAN INSTITUTE OF TECHNOLOGY KANPUR
July 2019
Abstract
Name of the student: Sagar Jaiswal Roll No: 17111037
Degree for which submitted: M.Tech.
Department: Computer Science & Engineering
Thesis title: Feature Engineering & Analysis Towards Temporally Robust
Detection of Android Malware
Thesis supervisor: Prof. Sandeep K. Shukla
Month and year of thesis submission: July 2019
With the increase in popularity of Android, the number of active users and the day to day
activity of each user on Android devices have also increased a lot. This has led to malware
authors targeting Android devices more and more. It has been reported that 8400 new
instances of malware are found every day. This implies that a new malware surfaces every
10 seconds! With the growth in amount, variants, diversity and sophistication in malware,
conventional methods often fail to detect malicious applications. Moreover Android allows
downloading and installation of Android applications from unverified sources. Therefore
it has become very easy for malware authors to bundle and distribute applications with
malware. Therefore fast and accurate detection of Android malware has become a real
challenge. These issues call for a more reliable, accurate, generalized and efficient model
for Android malware detection.
In this thesis, we build a light weight malware detection model that is capable of fast,
generalized, accurate and efficient detection of Android malware. To fulfill this objective,
we have designed a framework that deals with the challenges faced in Android malware
detection. We work with more than 8 lac’s of samples that are spread out over a period
of 2010 to 2019. We have created multiple datasets to perform parallel analysis for more
iv
robustness. Unlike previous datasets, the datasets we work on are highly class balanced
and therefore it is expected that machine learning models should train better on them. We
extract several categories of information like permissions, API, Intents, App Components
etc. and analyse their effectiveness towards Android malware detection. Then with the aim
of creating a light weight detection model, we propose and demonstrate the effectiveness
of three feature selection techniques to analyze and identify the only relevant features -
these are the features that are the most informative and important for malware detection
for samples across different years. Finally, we present relevant sets of features and a model
capable of temporally robust detection of Android malware.
Acknowledgements
I would like to express my sincere gratitude to my thesis advisor Prof. Sandeep K. Shukla
for his constant support and guidance. His inputs throughout the course of the thesis work
were extremely beneficial and constructive.
I would like to acknowledge my fellow graduate students for their feedback. I am also
grateful to Mr Saurabh Kumar for his input and cooperation.
I would also like to thank my family. Without their support and encouragement, this
accomplishment would not have been possible. I would also like to thank my friends for
always being there whenever I needed them.
v
Contents
Abstract iii
Acknowledgements v
Contents vi
List of Figures x
List of Tables xi
Abbreviations xiii
1 Introduction 1
1.1 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.5 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Background 4
2.1 Android System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.1 Linux Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.2 Hardware Abstraction Layer . . . . . . . . . . . . . . . . . . . . . . 6
2.1.3 Android Runtime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.4 Native C/C++ Libraries . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.5 Java API Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.6 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Android Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.1 Application Fundamentals . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.1.1 Application Manifest File . . . . . . . . . . . . . . . . . . . 8
2.2.1.2 Dex File . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.1.3 Permissions . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.1.4 Application Components . . . . . . . . . . . . . . . . . . . 10
vi
Contents vii
2.2.1.5 Intents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.1.6 Opcodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.1.7 System Commands . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Android Malware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.1 Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.2 Types of Malware . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.3 Malware Detection Approaches . . . . . . . . . . . . . . . . . . . . . 13
2.3.3.1 Invasion Detection . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.3.2 Misuse Detection . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.3.3 Anomaly Detection . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4.1 Androguard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4.2 Spyder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.5 Machine Learning Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3 Dataset 18
3.1 D/2010-2012 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2 D/2013-2015 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3 D/2016-2019 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.4 D/2010-2019 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.5 D/DREBIN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.6 D/AMD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4 Framework Design 22
4.1 Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.2 Framework Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.2.1 Information extraction module . . . . . . . . . . . . . . . . . . . . . 24
4.2.2 Feature Engineering & Analysis Module . . . . . . . . . . . . . . . . 25
4.2.2.1 Feature Construction/Representation . . . . . . . . . . . . 26
4.2.2.2 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . 26
4.2.2.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2.3 Learning Detection Model Module . . . . . . . . . . . . . . . . . . . 29
5 Methodology 31
5.1 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.1.1 Feature Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.1.1.1 Permissions . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.1.1.2 API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.1.1.3 Application Components . . . . . . . . . . . . . . . . . . . 33
5.1.1.4 Intents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.1.1.5 System Commands . . . . . . . . . . . . . . . . . . . . . . 33
5.1.1.6 Opcodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.1.1.7 Misc Features . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.1.2 Feature Representation . . . . . . . . . . . . . . . . . . . . . . . . . 34
Contents viii
5.1.3 Types of Feature sets . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.1.4 Feature size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.2 Data Cleaning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.3 Frequency Of usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.4 RFECV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.5 RFE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.6 Learning Final Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.7 Effectiveness of Identified Features . . . . . . . . . . . . . . . . . . . . . . . 48
6 Experimentation and Results 49
6.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.1.1 System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.1.2 Machine Learning Classifiers . . . . . . . . . . . . . . . . . . . . . . 49
6.1.3 Dataset Partition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
6.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6.2.1 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6.2.2 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6.2.3 Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6.2.4 Recall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6.2.5 F1 Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6.3.1 Analysis Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6.3.1.1 Requested Permissions . . . . . . . . . . . . . . . . . . . . 54
6.3.1.2 Used Permissions . . . . . . . . . . . . . . . . . . . . . . . 55
6.3.1.3 Restricted API . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.3.1.4 Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.3.1.5 Receiver . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
6.3.1.6 Activity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.3.1.7 Providers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.3.1.8 Intent Filters . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.3.1.9 Intent Objects . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.3.1.10 Intent Const . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.3.1.11 API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.3.1.12 API packages . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.3.1.13 System Commands . . . . . . . . . . . . . . . . . . . . . . 66
6.3.1.14 Opcodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.3.1.15 Misc Features . . . . . . . . . . . . . . . . . . . . . . . . . 68
6.3.2 Final Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.3.2.1 Base Model Results . . . . . . . . . . . . . . . . . . . . . . 69
6.3.2.2 Individual Category Results . . . . . . . . . . . . . . . . . 69
6.3.2.3 Combined Category Results . . . . . . . . . . . . . . . . . 70
6.3.2.4 Results on Test set . . . . . . . . . . . . . . . . . . . . . . . 70
6.3.2.5 Effectiveness of Identified Features . . . . . . . . . . . . . . 70
Contents ix
7 Conclusion and Future Work 74
7.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
Bibliography 75
List of Figures
2.1 Android System Architecture [2] . . . . . . . . . . . . . . . . . . . . . . . . 5
4.1 The Proposed Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2 Information Extraction Module . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.3 Feature Engineering & Analysis Module . . . . . . . . . . . . . . . . . . . . 25
4.4 Feature Selection Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.1 Drop in Number of Features in Feature sets . . . . . . . . . . . . . . . . . . 38
5.2 Accuracy vs Number of features plot for API Package based on frequencyof usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.3 Accuracy vs Number of Features plot for System Commands based on fre-quency of usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.4 Accuracy vs Number of Features plot for Requested Permissions based onfrequency of usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.5 Accuracy vs Number of Features plot for Requested Permissions using RFECV 44
5.6 Accuracy vs Number of Features plot for API Packages using RFECV . . . 45
5.7 Accuracy vs Number of Features plot for Intent Const using RFECV . . . . 46
5.8 Accuracy vs Number of Features plot for Opcodes using RFECV . . . . . . 47
x
List of Tables
2.1 Datasets for previous works . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1 Datasets Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.1 List of Feature Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.2 Number of Features in each Feature set . . . . . . . . . . . . . . . . . . . . 36
6.1 Partition of Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6.2 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6.3 Performance results on evaluation set for requested permissions using initialfeatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
6.4 Performance results on evaluation set for requested permissions using finalfeatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
6.5 Performance results on evaluation set for used permissions using initial fea-tures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.6 Performance results on evaluation set for used permissions using final features 55
6.7 Performance results on evaluation set for restricted API using initial features 56
6.8 Performance results on evaluation set for restricted API using final features 56
6.9 Performance results on evaluation set for service using initial features . . . 57
6.10 Performance results on evaluation set for service using final features . . . . 57
6.11 Performance results on evaluation set for receiver using initial features . . . 58
6.12 Performance results on evaluation set for receiver using final features . . . . 58
6.13 Performance results on evaluation set for activity using initial features . . . 59
6.14 Performance results on evaluation set for activity using final features . . . . 59
6.15 Performance results on evaluation set for providers using initial features . . 60
6.16 Performance results on evaluation set for providers using final features . . . 60
6.17 Performance results on evaluation set for Intent Filters using initial features 61
6.18 Performance results on evaluation set for Intent Filters using final features . 61
6.19 Performance results on evaluation set for Intent Objects using initial features 62
6.20 Performance results on evaluation set for Intent Objects using final features 62
6.21 Performance results on evaluation set for Intent Const using initial features 63
6.22 Performance results on evaluation set for Intent Const using final features . 63
6.23 Performance results on evaluation set for API using initial features . . . . . 64
6.24 Performance results on evaluation set for API using final features . . . . . . 64
6.25 Performance results on evaluation set for API packages using initial features 65
6.26 Performance results on evaluation set for API packages using final features 65
xi
List of Tables xii
6.27 Performance results on evaluation set for system commands using initialfeatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.28 Performance results on evaluation set for system commands using Finalfeatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.29 Performance results on evaluation set for opcodes using initial features . . . 67
6.30 Performance results on evaluation set for opcodes using final features . . . . 67
6.31 Performance results on evaluation set for Misc Features using initial features 68
6.32 Performance results on evaluation set for Misc Features using final features 68
6.33 Performance results of relevant categories on evaluation set . . . . . . . . . 70
6.34 Performance results using combined categories . . . . . . . . . . . . . . . . 72
6.35 Performance results of final model on different test sets . . . . . . . . . . . 73
6.36 Performance results to show effectiveness of identified features . . . . . . . . 73
Abbreviations
OS Operating System
AV Anti Virus
DVM Dalvik Virtual Machine
JVM Java Virtual Machine
VM Virtual Machine
ART Android Runtime
ELF Executable and Linkable Format
NDK Native Development Kit
SDK Software Development Kit
APK Android Package
LR Logistic Regression
NN Neural Network
RF Random Forest
ET Extra Tree
Acc Accuracy
Prec Precision
Rec Recall
F1 F1 Score
RP Requested Permission
IC Intent Const
AP API Packages
SC System Commands
O Opcodes
xiii
Abbreviations xiv
M Miscellaneous
Dedicated To my Parents
xv
Chapter 1
Introduction
In recent years, Android has overtaken many other mobile operating systems to become
one of the most popular mobile platform in the world. A recently shared report by the IDC
for smartphone operating system global market share [13] shows that in 3rd quarter of the
year 2018, the total market share of Android was 86.8%. Developers prefer Android over
other smartphone operating systems for developing applications because it is completely
open source. On the other hand, a user is biased to opt for Android smartphones due
to availability of low to high end models, easy to use, customization, high level of multi-
tasking, custom ROMs, support of large number of applications etc. In May 2019, Google
revealed that there are now more than two and half billion Android devices that are being
used actively in a month [21].
1.1 Problem
With the increase in popularity of Android, the number of active users and the day to day
activity of each user on Android devices have also increased a lot. This has led to malware
authors targeting Android devices more and more. It has been reported by Gadgets360
[33] that 8400 new instances of malware are found every day. This implies that a new
malware surfaces every 10 seconds!
1
Chapter 1. Introduction 2
1.2 Motivation
Google highly recommends to use trusted and verified sources like Google Play Store
for installation of Android applications. Google Play Store uses Google Play Protect to
provide security features to protect users from malicious applications [2]. It scans all the
Android applications before and after installation to ensure no foul play is happening.
However, due to the increasing numbers and variants in malware, they keep getting into
the Google Play Store. According to a news report by techcrunch [36], a new kind of
mobile adware got past the Google Play Store security mechanisms and hid in hundreds of
Android applications. These infected applications were downloaded more than 150 million
times. Third party markets can also be used for downloading and installation of Android
applications. Android allows downloading and installation of Android applications from
unverified sources. Therefore it has become very easy for malware authors to bundle and
distribute applications with malware.
1.3 Challenges
With the growth in amount, variants, diversity and sophistication in malware, conventional
methods often fail to detect malicious applications. Malware authors often deploy various
techniques like code obfuscation to avoid detection [5]. Such techniques are implemented to
protect malware from reverse engineering. This leads to transformation of strings/features
into a different form that are less informative and contribute rarely to machine learning
based malware detection systems. As a result of such transformations, the feature set
size grows as well which increases the complexity for the machine learning techniques to
train a classifier for efficient malware detection. Therefore fast and accurate detection
of Android malware has become a real challenge. These issues call for a more reliable,
accurate, generalized and efficient model for Android malware detection.
1.4 Contribution
The aim and objective of this work is to face the challenges in the area and build a light
weight malware detection model that is capable of fast, generalized, accurate and efficient
detection of Android malware. To fulfill this objective, we have designed a framework that
deals with the challenges faced in Android malware detection. The major contributions of
this work towards Android malware analysis and detection are -
Chapter 1. Introduction 3
• We analyze the effectiveness of different features extracted using static analysis for
detecting Android malware using multiple datasets.
• We present several Feature Selection techniques that filter noise present in the data
and output only relevant and informative features.
• We evaluate and demonstrate the effectiveness of our model for Android malware
detection.
1.5 Thesis Outline
Chapter 2 presents an outline of Android by discussing the Android platform architecture,
Android applications and its fundamentals. Then it provides an introduction to malware,
its evolution over time, types of malware and malware detection approaches. It also briefly
explains the tools used during this study. Finally, it ends with a discussion on related work
in this field. Chapter 3 discusses about the source of Android applications - both benign
and malware. It describes the steps involved in creation of each dataset along with its size
and properties. Chapter 4 provides an insight into the framework and its different modules.
Chapter 5 discusses the features constructed, its size, properties, trends and patterns. It
also describes the implementation details of the proposed framework. Chapter 6 presents
the experimentation and the results of the analysis. It talks about the setup environment
of the system, dataset split into train, evaluation & test sets, feature engineering and
model performances. Finally, Chapter 7 concludes this study and provides direction for
future work.
Chapter 2
Background
Android was developed by Google. It is an open source mobile OS based on the Linux
kernel and have been released under the Apache v2 open source license. The following
sections provide an overview of Android System Architecture, benign & malware applica-
tion, Android tools and techniques to distinguish them and previous research work done
in this field.
2.1 Android System Architecture
Android Architecture [2] is a stack that contains around 6 major components as shown in
the Fig 2.1 Each component in the stack along with its corresponding elements are inte-
grated in a manner that provides optimal environment for the development and execution
of applications on Android devices. The following subsections discuss each of these layers
in detail.
2.1.1 Linux Kernel
There are many advantages of using the open source Linux kernel. It serves as the foun-
dation of the Android platform [19]. The underlying functionalities of the Linux kernel
are -
• Device drivers
• Power management
4
Chapter 2. Background 5
Figure 2.1: Android System Architecture [2]
Chapter 2. Background 6
• Memory management
• Device management
• Process management
• Security management
• Network stack
2.1.2 Hardware Abstraction Layer
This layer provides a level of abstraction by allowing the device hardware and the software
to communicate. It defines a standard interface which enables implementation of hardware
capabilities by higher-level Java API frameworks. There are several modules present in the
hardware abstraction layer. They are loaded when a call to access a hardware component
is made. Hardware abstraction layer implementations are done without affecting higher
levels [12].
2.1.3 Android Runtime
DVM was the default runtime environment before Android version 5.0 (API level 21).
DVM is similar to JVM and uses Just-in-Time compilation where every time the applica-
tion is launched, the dex code is translated into machine dependent code within a virtual
machine.
ART is the runtime environment for Android version 5.0 (API level 21) or higher. It works
on the principle of Ahead-of-Time compilation. In ART environment, an APK is compiled
into the native code known as ELF at the time of installation. The ELF executable
version is run every time the application is launched resulting in better performance of the
application.
2.1.4 Native C/C++ Libraries
The functionality of native libraries is accessed through Java framework APIs. Android
NDK is used to access native platform libraries while developing an application with
requirements of C and C++ code. With the help of natives libraries (written in C/C++
language), native code for Android system components and services are built.
Chapter 2. Background 7
2.1.5 Java API Framework
This layer lies on top of core libraries & Android runtime and provide interfaces for the
development of Android applications. It also provides high level services to Android ap-
plications using Java classes. Applications directly interact with this framework. This
framework includes APIs such as telephony, locations, resources, package managers etc.
These APIs (written in Java language) are used to access the feature set provided by the
Android OS . Using API simplifies the process of developing an Android application by
reuse of components and services like -
• Telephony Manager
• Resource Manager
• Activity Manager
• Notifications Manager
• View System
• Content Providers
• Package Manager
• Location Manager
2.1.6 Applications
Applications lie on the top most layer of the system architecture. There are two types of
applications -
• Default applications - They are the set of core applications such as those for SMS
or MMS messaging, contacts, email and internet browsing, etc that are provided by
the device vendor itself.
• Third party applications - These are the set of applications that are developed by a
third party.
Chapter 2. Background 8
2.2 Android Applications
These applications are developed to run on Android operating system. They are available
on Android application markets like Google Play Store, Amazon Store etc. The code for
Android applications (written in Java, Kotlin or C++) along with other things like data,
resource files etc. is compiled into an APK by the SDK tools provided by Android. APK is
a type of archive file (E.g. zip, jar, tar etc.) with .apk as the extension [39]. It is based on
the Jar File format. Android devices uses this format to install the Android application.
An APK contains -
• META-INF directory: This directory contains files like MANIFEST.MF, CERT.RSA etc..
• lib: This directory contains the platform dependent compiled code for processors
like MIPS, x86, x86 64, ARM etc.
• resources.arsc: The precompiled resources like binary XML are included in this
file. The resources that are not compiled here are placed in res directory.
• assets: This directory contains applications assets.
• AndroidManifest.xml: This file is an additional Android manifest file and contains
information like version, name etc.
• classes.dex: classes.dex is in the dex file format and it represents compiled
classes. Both Dalvik virtual machine and Android Runtime can understand the dex
format.
2.2.1 Application Fundamentals
This section provides a brief synopsis of the Android application.
2.2.1.1 Application Manifest File
Every application must have the application manifest file – AndroidManifest.xml. This
manifest file includes important information about the application that are needed by the
Android OS like – application’s package name, version name and version code, components
of the application, permissions etc.
Chapter 2. Background 9
2.2.1.2 Dex File
In an Android project, all the Java source files are compiled into a file format called
.class. The .class file contains java bytecode instructions. They in turn are translated
into the .dex format (dex bytecode). Both the Android runtime environments (older DVM
and newer ART) can understand the dex bytecode. The classes and the methods in an
application are referenced by the classes.dex file.
2.2.1.3 Permissions
Android security architecture does not allow any application by default to perform op-
erations like accessing user’s private data, keeping the device awake, accessing another
application’s data or any other operation that would harm the user, the operating system,
or other apps . So to access sensitive user data or system features, Android apps must
request permission. All such permissions must be declared in the manifest file. There are
four protection levels defined for Android Permissions -
• Normal permissions - The permissions that can cause very little impact or harm to
the user’s privacy when granted are categorized as Normal Permissions. Therefore
normal permissions are automatically granted upon request.
• Dangerous permissions - Permissions to sensitive data and resources are categorized
as dangerous permissions. They are prompted to the user to be granted at runtime.
Only dangerous permissions require agreement from the user.
• Signature permissions - This permission is granted at the install time. The ap-
plication that attempts to use the permission & the application that defines the
permission must be signed by the same certificate.
• SignatureOrSystem permissions - These permissions works similar to the Signature
level permission. The only difference is the certificate used to sign the application.
SignatureOrSystem permission are related to the device vendor certificate and only
granted to those applications that are signed by the certificate of device vendor.
Chapter 2. Background 10
2.2.1.4 Application Components
For an Android application, application components serve as an essential building block.
Each Component defines an entry point. An entry point is used by the user or the system
to enter the application. There exists four types of application components -
• Activity - This component defines an entry point for the user. It provides an interface
with a single screen for interacting with the user. E.g. in an email app, there can
be several activities, one to show the new emails list, another to read them etc.
• Services - This component serves as the general purpose entry point to the application
and is used for running background operations. A service can also be used for
providing functionality to other applications. It does not provide any user interface.
E.g. working on a different application while the music is being played by a service
in the background.
• Broadcast Receivers - This component defines another entry point into the applica-
tion which enables the applications to respond to system wide broadcast announce-
ments even when the application is not running. E.g. setting an alarm for an
upcoming event, low battery announcement.
• Content Providers - This component manages application data. With the help of
the content provider, one application can share its data with another application
securely. For example, an application can read the contact details with the help of
contact’s content provider.
2.2.1.5 Intents
An intent describes an operation or a message to facilitate inter-process and intra-process
communications. They are used to activate activities, services, and broadcast receivers.
At runtime, individual components are bound to each other.
2.2.1.6 Opcodes
Opcodes refers to the Dalvik instructions in the raw form similar to assembly instruc-
tions. These instructions are mainly designed to be interpreted by the android runtime
environment (DVM/ART).
Chapter 2. Background 11
2.2.1.7 System Commands
System commands refers to executing a shell command (e.g. su, uname, etc.) inside an
application. To execute a command, Android uses a exec() of the Runtime class by
creating a sub process.
2.3 Android Malware
With the growing popularity of the Android, focus of attackers and malware authors
towards targeting the Android platform has also increased a lot. Malware authors are
taking the task of creating malware at new levels. Malware are now more sophisticated
and impactful. These malware are being disguised as benign and useful applications.
2.3.1 Evolution
This section shows the evolution of malware in mobile platforms [37].
• Timofonica - the first known mobile virus. Its discovery can be traced back to June
2000.
• In June 2004, an anti-piracy Trojan hack was made by ojam. Then later in July,
a proof-of-concept virus named ‘Cabir’ targeting the Symbian OS was released.
‘Cabir’ virus used bluetooth to spread [40].
• In March 2005, Commonwarrior-A, a computer worm replicating through MMS and
targeting Symbian OS was reported. Commonwarrior was built on basic concepts of
Cabir. It was the first malware that affected the victims by financially exploiting
them.
• In 2006, RedBrowser - the first trojan that has the capability to infect multiple
mobile platform was discovered. It was built by further extending Commonwarrior
functionalities.
• In August 2010, AndroidOS.FakePlayer.a, a Trojan which exploited SMS services
was discovered by Kaspersky Lab. AndroidOS.FakePlayer.a was the first SMS
malware that targeted the Android operating system.
Chapter 2. Background 12
• In 2011, DroidDream, a Trojan with the ability to send sensitive data to remote
servers as well as install other applications without the users consent was discovered
in the Google Play Store. DroidDream has infected more than 50 applications causing
damage to hundreds of thousands of victims.
• In 2012, Boxer a Trojan that showed similar behavior to Commonwarrior was dis-
covered. Boxer used mobile country and network code to distribute itself via text
messages in more than 60 countries.
• In 2013, FakeDefender, the first ransomware targeting Android was discovered.
2.3.2 Types of Malware
There are several types of malware like -
• Ransomware - Ransomware encrypts data making it inaccessible & unavailable to
the use.
• Spyware - Spyware spies on the Android devices and steals valuable information like
passwords etc.
• Adware - Adware is a type of malware that shows continuous popups of advertise-
ments.
• Trojan - A trojan gets inserted in the user’s computer by disguising as a program
that the user has willingly downloaded.
• Virus - A virus is a type of malware which when executed infects other files by
inserting malicious code.
• Worm - Worms are the type of malware that keeps reproducing itself to infect other
devices. To execute, worms does not need any user interaction.
• Expander - Expander harms the user by increasing the billing amounts.
• Backdoor - Backdoor is a piece of malicious code that allows unauthorized access to
the infected devices.
Chapter 2. Background 13
2.3.3 Malware Detection Approaches
Malware detection approaches can be categorised into three parts.
2.3.3.1 Invasion Detection
This method detects any attack or unauthorized access.
2.3.3.2 Misuse Detection
This method detects any misuse by insiders. It stores signatures of malware. Advantages:
• Very good detection of known malware
• No false positives
Disadvantages:
• This method cannot detect even minimal variants of known malware
2.3.3.3 Anomaly Detection
This method detects malware based on their behaviour. They detect specific patterns in
the given samples and try to identify abnormal behaviour.
Advantages:
• This method can efficiently detect unseen and unknown malware.
Disadvantages:
• A large set of unique samples is needed to identify malicious behaviour.
• High false positive rate
There are two types of Anomaly detection techniques -
Chapter 2. Background 14
• Static Analysis - In Static analysis, we can analyse an application/program without
executing it. This technique can be applied on both the source code as well as binary
file (in our case APK) to find out the security threats or the malicious functionality
in an application.
• Dynamic Analysis - In dynamic analysis, behaviour of an application is analysed dur-
ing its execution in an isolated environment. This method can capture the malicious
behaviour, which is not disclosed in the static analysis.
2.4 Tools
2.4.1 Androguard
Androguard[7] is Python based tool that is widely used for reverse engineering of Android
applications and performing static analysis.
2.4.2 Spyder
Spyder [30] is a python based environment designed for editing, analysis and debugging of
python code.
2.5 Machine Learning Classifiers
Classification is the process of predicting a class or category of a sample. In the case of
binary classification, there are only two categories. Classifiers are the algorithms that do
the classification [38]. There are several classifiers like -
• Logistic Regression [20]
• Random Forest [25]
• Neural Network [24]
• Extra Tree [23]
Chapter 2. Background 15
2.6 Related Work
• Wang et.al. [34] extracts used permissions and API calls from applications to train
a model using AdaboostM1 classifier. By evaluating their model on 1170 malware
samples and 1205 benign samples, they claim a True Positive Rate of 99.6%.
• Feizollah et.al. [8] shows effectiveness of explicit Intent and implicit Intent for An-
droid malware detection. The evaluation have been done on 5560 malware samples
and 1846 benign samples. They claim 91% detection accuracy(TPR) using Intents.
• Sun et.al. [32] proposes a permission based model trained using SVM. It claims an
accuracy of 93.62% for malware in the dataset and 91.4% for unknown malware.
• Firdaus et.al. [10] focuses on system commands for the identification of root-malware
through static analysis. The evaluation has been done by selecting 550 malware
out of from 1260 malware available in Malgenome dataset and 550 benign samples
downloaded from Google Play Store. It claims an accuracy of 92.5%.
• Gaviria et.al. [11] performs analysis on 1259 benign samples collected from Google
Play Store and 1259 malware samples collected from Android Genome Project tar-
geting opcodes as features using several machine learning classifiers. They claim a
precision of 96.758%.
• Using static analysis, DREBIN [3] has gathered 8 types of features from 1,23,453 be-
nign samples and 5,660 malware samples and trained a model using SVM. DREBIN
claims to have outperformed several related approaches with detection score of 94%.
The type of features extracted by DREBIN are -
– Hardware Components
– Requested Permissions
– App Components
– Filtered Intents
– Used Permissions
– Suspicious API Calls
– Restricted API Calls
– Network Addresses
Chapter 2. Background 16
• ANASTASIA [9] have extracted 6 type of features using static analysis. Then with
the aim of building a promising model, they have used several machine learning
classifiers like Adaboost RandomForest , K-NN , Logistic Regression, etc. They have
also used feature importance obtained by Extra Trees-Classifier to discard features
below a given threshold value. By evaluating the performance of their framework
on 8,677 malware samples and 11,187 benign samples, they have got a true positive
rate of 97.3%. They claim the results are better then the state-of-the-art detection
methods for Android malware. The type of features extracted by ANASTASIA are-
– Intents
– Used Permissions
– System Commands
– Suspicious API Calls
– Malicious Activities
• Li et.al. [16] proposes a machine learning algorithm that uses Factorization Machine.
They have evaluated the models performance on two malware datasets - DREBIN
and AMD [35].
– DREBIN : They have collected 5560 malware samples from the DREBIN dataset
and 5600 benign samples from the the Internet and extracted 7 types of features
from them resulting in 93,324 features. They claim a precision of 99.91%
– AMD : They have also collected 24,553 malware samples from the AMD dataset
and 16,753 benign samples from the Internet and extracted 2,94,019 features.
They have got a precision of 99.35% for this dataset.
They extract 7 types of features -
– Restricted APIs
– Suspicious APIs
– Used Permissions
– App Components
– Hardware Features
– Permissions
– Intent Filter
Chapter 2. Background 17
Summary: In the previous works we observe that the datasets are very small and un-
balanced. This skewness in the dataset results in the machine learning approaches giving
improper models and leading to higher values in evaluation metrics for these datasets.
Also, we observe that the samples gathered are from a very narrow time window as shown
in Table 2.1. Therefore, the machine learning models trained on such datasets are un-
able to generalize and sustain over time. This is because similar malware tend to appear
together in time. Also, the Android versions being targeted, the environment (DVM or
ART), the hardware specific components being targeted (like Bluetooth was more popular
in earlier years, WiFi is more popular now), and the mechanisms used (like SMS, emails)
vary greatly over time. In such a scenario, the machine learning models that are trained
on such limited datasets work with highly specific features and tend to over-fit for that
particular dataset (or time).
Therefore, to counter this, we work with multiple datasets that are spread out over a
period of January 2010 to February 2019. Unlike previous datasets, these are highly class
balanced and therefore it is expected that machine learning models should train better on
them. We have analyzed the instances occurring over time and have identified the only
relevant features - these are the features that are the most informative and important for
malware detection across different years.
Dataset Source Malware Benign Total Malware Collection
1 Genome 1170 1205 2375 August 2010 to October 2011
2 Drebin 5560 1846 7506 August 2010 to October 2012
3 Drebin 5494 5494 10,988 August 2010 to October 2012
4 Genome 550 550 1100 August 2010 to October 2011
5 Genome 1259 1259 2,518 August 2010 to October 2011
6 Drebin 5560 1,23,453 1,29,013 August 2010 to October 2012
7
Genome +
Drebin +
M0Droid +
Virus Total
18,677 11,187 29,864 2009 -2015
8DREBIN
AMD
5660
24,553
5660
16,753
11,320
41,306
August 2010 - October 2012
2010 -2016
Table 2.1: Datasets for previous works
Chapter 3
Dataset
AndroZoo [1] is a growing repository that provides Android applications - both benign
and malware for Android malware analysis. It contains more than 9 million applications
that have been analysed by multiple AV engines for labeling - malware and benign. For
our work, the samples that have been recognised as benign by all the AV engine have
been considered to be benign. As for malware, the samples that have been recognised as
malware by at least 10 different AV engine have been considered to be malware.
For an Android application, dex date represents the applications creation date or classes.dex
last modification date (last date when the app’s code was built). The dex date for an An-
droid application is in the dex file that is present inside the zip. AndroZoo provides
SHA256 of applications along with its dex date. We have used the SHA256 and dex date
to identify unique samples and categorize them based on year respectively.
In this work, available set represents the applications available at AndroZoo repository.
We have collected these applications from the repository and categorised them based on
the dex date to make set of samples for both the classes - malware and benign to create
multiple datasets.
Other than AndroZoo, Android malware samples have also been gathered from DREBIN
and AMD. Each dataset along with its source and characteristics have been explained in
detail in following sections.
18
Chapter 3. Dataset 19
3.1 D/2010-2012
D/2010-2012 is highly class balanced dataset that has a large number of samples from
both the classes - malware and benign for analysis. It contains a total of 3,87,236 samples
evenly distributed among both the classes with the malware class having 1,93,612 samples
and the benign class having 1,93,624 samples. Both the malware and benign samples
have been collected by randomly selecting them from the available set of applications in
AndroZoo repository that have dex date in between January 2010 to December 2012.
3.2 D/2013-2015
D/2013-2015 is also a highly class balanced dataset with the samples size of 3,70,294. It
contains 1,85,181 malware applications and 1,85,113 benign applications. All the samples
have dex date in between January 2013 and December 2015 and have been gathered via
random selection from the available samples set.
3.3 D/2016-2019
D/2016-2019 is again a highly class balanced dataset containing more recent samples. It
contains 66,901 samples where 33,460 are from the benign set and 33,441 are from the
malware set. The samples from both the classes have dex date in between January 2016
to February 2019. All the malware applications present in the available set have been
used where as for the benign applications, we have randomly selected samples from the
available set.
3.4 D/2010-2019
We combine all the samples from the above three datasets to create a final dataset con-
taining 8,24,431 samples where 4,12,234 are malware and 4,12,197 are benign. We use this
dataset to train and test our final model.
Chapter 3. Dataset 20
3.5 D/DREBIN
D/DREBIN is a dataset that has been created by collecting a total of 11,048 samples. The
dataset contains 5,479 malware from the DREBIN dataset - a popular dataset for malware
analysis. The DREBIN dataset contains malware from 179 different families and have been
collected in the period of August 2010 to October 2012. For the benign samples, we have
used the AndroZoo repository. We have gathered 5,569 samples by randomly selecting
them from the available benign samples. The dex date for all the samples in the benign
class also lies within the same range as the malware class.
3.6 D/AMD
D/AMD contains a total of 48,749 samples where 24,489 are benign and 24,260 are mal-
ware. The set of samples for malware class have been collected from the AMD dataset -
another well accepted and popular dataset among researchers for Android malware analy-
sis that contains more recent samples as compared to the DREBIN dataset. The malware
samples in AMD dataset have been collected in between 2010 to 2016. The benign set
contains samples randomly selected from the available set. All the samples for benign set
also have dex date in between 2010 and 2016.
Table 3.1 shows the summary of the datasets.
Dataset #Malware #Benign #Total Year-Range
D/2010-2012 1,93,612 1,93,624 3,87,236 2010 - 2012
D/2013-2015 1,85,181 1,85,113 3,70,294 2013 - 2015
D/2016-2019 33,441 33,460 66,901 2016 - 2019
D/2010-2019 4,12,234 4,12,197 8,24,431 2010 - 2019
D/DREBIN 5,479 5569 11,048 2010 - 2012
D/AMD 24,260 24,489 48,749 2010 - 2016
Table 3.1: Datasets Summary
Chapter 3. Dataset 21
Summary: A model built upon samples from a narrow time frame identifies important
features to detect malware from similar time frame. Similarly a model trained using small
set of samples can only detect similar malware. However malware keeps evolving over
time. Models built upon analysing small set of samples or samples from a particular time
frame will not identify such evolved and sophisticated malware. Therefore a large set that
contains samples over the years is needed to built a sustainable malware detection model.
We work with large number of samples that are spread out over a period of 2010 to
February 2019. The large set and the wide range of samples covers analysis on many
variants of malware that otherwise would have been left out. Learning an estimator and
evaluating its performance on all known variants of malware is necessary step to provide
a generalised model that performs and sustains well.
Many features used in Android malware detection are time and size bound. Our analysis
shows there are certain categories of features that preforms well on samples for a particular
time frame but shows poor performance on samples of a different time frame. Similarly
there are categorises that shows good performance with small set of samples compared to
large set of samples and vice versa. Therefore identifying such categories that shows good
detection results on malware of different time frames or different years and on different
sample set size is an important step towards Android malware detection.
To face the above mentioned issues, we have categorised the set of samples into three
categories based on the dex date. We analyse performance of each category using several
machine learning classifiers on several datasets using different evaluation metrics. These
datasets are large and highly class balanced and therefore it is expected that machine
learning models should train better on them.
Chapter 4
Framework Design
We propose a framework for extraction, analysis and refinement of raw data to output the
most relevant information to learn a model that is capable of efficient detection of malware
in the Android environment. The following sections discuss the proposed framework and
its modules in detail.
4.1 Framework
The proposed framework has three modules as shown in Fig 4.1. Each module is a separate
process and can be used independently.
• Information Extraction Module - Information extraction module is responsible for
unpacking/decompiling an Android application and extracting raw data from the
APK. The extracted information is then stored in the Json file format. The informa-
tion extraction module can be used by passing two arguments - path to the directory
containing samples and path to directory to store the extracted data. Section 4.2.1
discusses more about this module.
• Feature Engineering & Analysis Module - This module first constructs meaning-
ful features from the extracted data. Then it implements various feature selection
techniques and performs analysis using several machine learning classifiers. Feature
Engineering & Analysis module can be used by passing two arguments - path to
the directory containing Json format files and path to store the final set of features
and the analysis results. Section 4.2.2 discusses the Feature Engineering & Analysis
Module in depth.
22
Chapter 4. Framework 23
• Detection Model - This module takes the path to set of samples as the argument
and learns a final malware detection model based on those samples for fast, accurate
and efficient malware detection. Section 4.2.3 discusses more about this module.
Feature Engineering
& Analysis
Learning Model
Information Extraction
Three Modules
Model
Figure 4.1: The Proposed Framework
The proposed framework has following advantages over the existing state of the art mal-
ware detection approaches.
• Light Weight - This framework is built on static analysis approach integrated with
highly effective feature engineering techniques resulting in a very light weight model
capable of efficiently detecting malware.
• Generalized and sustainable - Several large and highly class balanced datasets have
been used for the analysis. The dataset taken contains samples - malware and benign
from throughout the years and of as recent as 2019. This wide range of samples covers
varieties of polymorphic, sophisticated and varying malware that has been discovered
over the years.
Chapter 4. Framework 24
• Better Detection Performance - Several feature selection techniques have been im-
plemented that reduces noise in the data that otherwise would have led to poor
performance of the model.
• Scalable and Efficient - The proposed framework implements static analysis tech-
niques with the help of a reverse engineering tool - Androguard which unlike dynamic
analysis techniques takes relatively very less amount of time. Therefore the analysis
can be done on a large number of samples. This is important to face the issue of
rapidly growing size/number of malware.
• Novel set of Features - The feature set being used in this work is effective in detec-
tion of Android malware. We demonstrate the detection results of each category of
features using several classifiers in section 6.3.1.
• Less Computation Time and Memory Usage - The proposed framework implements
several highly effective feature selection techniques that outputs the only relevant
and important set of features. Therefore, for analysing an Android application, only
these features need to be extracted and stored from the application which saves a
lot of time and memory
4.2 Framework Modules
The following subsections describe each of the module in depth.
4.2.1 Information extraction module
Among several things, a typical Android application contains application code, resources,
assets, certificates, and manifest file that are packaged into the APK. Using reverse en-
gineering tools like androguard, apkparser etc, there is abundant amount of information
that can be extracted and analysed from these files. The main features extracted for this
work are from the manifest file and the dex code. The proposed framework extracts several
categories of information from these two files as shown in Fig 4.2 .
The first argument to the Information Extraction module is path to the set of samples.
Each sample in the dataset is loaded and passed to androguard for reverse engineering.
Using androguard, first the Android application is unpacked & decompiled and then the
Chapter 4. Framework 25
Figure 4.2: Information Extraction Module
manifest file & the dex code are retrieved from the Android application to extract infor-
mation like permissions, API, opcodes, system commands etc. from each sample. The
extracted information is dumped in the directory provided as the second argument while
running this module. Each Json file contains a dictionary with a key - value pair where
key is the type of information extracted like permission, API, activity, intent etc. and
values are all the relevant information extracted in that category.
4.2.2 Feature Engineering & Analysis Module
This module performs an in-depth analysis of the feature sets as shown in Fig 4.3. The
following subsections discuss each process in this module in detail.
Figure 4.3: Feature Engineering & Analysis Module
Chapter 4. Framework 26
4.2.2.1 Feature Construction/Representation
The extracted data from the applications are present in files as dictionaries with values as
list of strings. They need to be converted into a different format that represents meaningful
information for enabling machine learning classifiers to understand it.
Due to the high dimensionality of features, we first represent the data in sparse matrix
format. Then after data cleaning step in feature selection process, we represent it in the
dense matrix format. Section 5.1 discusses the details regarding representation of the
features.
4.2.2.2 Feature Selection
Often we have thousands of features and Machine learning is always challenged by such a
large number of features. These features are often irrelevant or redundant which increases
the complexity of the model. Therefore, we need to consider only the most important and
relevant features. Benefits of building a model with such features are -
• It reduces the computational cost and time for training a model.
• It reduces the complexity of a model and makes it easier to interpret.
• It reduces the variance of the model and therefore over-fitting.
• It improves the accuracy of a model if the proper features are chosen.
Feature Selection methods are often used to solve such problems [15]. Fig 4.4 shows the
feature engineering module for this work. This module has five phases listed below and
are discussed in subsequent sections.
• Data Cleaning
• Feature Elimination Based on Frequency of Usage
• Identification of Saturation Point using RFECV
• Extracting Optimal Set of Features using RFE
• Feature selection Based on Co-relation between Features
Chapter 4. Framework 27
Figure 4.4: Feature Selection Process
Data Cleaning
In data cleaning step, we search for a final threshold value (by varying c Init) using Equa-
tion 4.1 such that there is a significant drop in number of less or rarely used features. We
discard those features if they do not contribute to any improvement in detection perfor-
mances. Section 5.2 discusses the implementation details for the data cleaning step.
thresholdval = c Init (4.1)
Chapter 4. Framework 28
Feature Elimination Based on Frequency of Usage
This step is an extension to the data cleaning step. For each feature in a feature set, we
look at the frequency of usage of the feature by malware as well as benign applications
(how many times the feature has been used by malware and how many time the feature
has been used by benign) and construct multiple c values. This c values serve as a vari-
able in finding the threshold values for both malware and benign applications. Features
with frequency of usage less than the threshold values in both malware and benign respec-
tively and with difference in frequency of usage between malware and benign less than
c/2 are discarded. For every c, we follow this step to reduce features and evaluate the
model performances. The point of saturation in results after which there is no significant
improvement in performance serves as the reduced set of features. The implementation
details for this process is explained in Sec 5.3.
RFECV - Recursive Feature Elimination with Cross Validation
RFECV [27] as the name suggests is the process of recursively eliminating features using
feature ranking. The feature ranking is done on the basis of feature importance. Then
a cross-validated selection of the best number of features is done. We use extra tree as
the base estimator for the RFECV. RFECV takes a classifier C, a scoring function F and
number of features k to eliminate in every step as input parameter. It starts by taking all
the features, computes the performance score and feature rank based on their importance.
Then it eliminates the lowest k features in the ranking and remakes the predictions, the
computation of the performance score and the feature ranking. The iterations stop when
there is no significant improvement in the results. An accuracy vs the number of features
graph is generated. The number of features to be used is chosen from this graph. The
point where there is no significant improvement in results is taken as the count for number
of features to be used.
RFE - Recursive Feature Elimination
RFE [26] is a process for choosing the best set of features in data. It reduces features by
recursively considering smaller sets of features each time to achieve the desired number
of features. Starting with the initial feature set, an estimator is trained on these features
to calculate the importance of each feature in the feature set. Then, depending on the
step parameter (say step = n), n least important features are pruned from current set of
features. This process is repeated on the pruned set until the size of the reduced set is as
Chapter 4. Framework 29
same as the desired number of features to select. Using RFE, we reduce the feature set to
an optimal subset of features based on the result gathered from the RFECV.
Feature Reduction Based on Correlation between Features
We remove features that are dependent on each other using a co-relation matrix. Corre-
lation refers to the measurement of linear relationship between two variables [22]. Two
variables that are linearly dependent will have a higher correlation than two variables
which are non-linearly dependent. Features with high correlation are more linearly depen-
dent. Such features have almost the same effect on the dependent variable. So, when two
features have high correlation, we can drop one of the two features.
4.2.2.3 Analysis
This step makes general inferences on the proposed techniques and results. Starting with
the initial set of features and in each step of feature reduction, we have analysed the model
performances by varying several things like metrics for evaluation, learning parameters,
threshold values etc. For few steps, we have also analysed the results using 5 fold cross
validation to generalize the results and to avoid over fitting. All the analysis have been
done on multiple datasets. Section 6.3.1 shows the trends and results of the analysis of
several features sets.
4.2.3 Learning Detection Model Module
The first two modules of the framework analyse the extracted data and the proposed tech-
niques for better detection and output the results. Based on the analysis and the results
from the first two modules, this module identifies the important categories of features and
learns several models based on combination of different categories( like API Package +
Requested Permission, Opcodes + System Commands + Requested Permissions etc.) and
analyses the performance of each of those model. Finally, a lightweight detection model
based on the only relevant and meaningful information extracted from the applications is
learned. We demonstrate the performance of the final model in section 6.3.2 by testing it
on multiple test sets. We also demonstrate the effectiveness of the identified relevant fea-
tures by learning two models on the D/AMD and D/DREBIN dataset respectively based
on these relevant features and present its performance using several evaluation metrics.
Chapter 4. Framework 30
Summary: This chapter deals with challenges in Android Malware detection like Curse
of high dimensionality, noise, skewness etc. by proposing a framework that is capable
of efficient, fast, accurate and generalised malware detection. It has several modules
and steps/processes within the modules. The proposed framework performs analysis and
output results using multiple datasets to give a robust model.
Chapter 5
Methodology
This chapter provides an insight into the features constructed for this work, the way they
have been represented, their size and properties. It also shows trends in features related to
frequency of usage, their importance with respect to other features and their contribution
in detection of malware.
5.1 Features
The information extracted from the applications are in the form of lists and dictionaries.
We need to convert them into meaningful features and represent them accordingly to use
with machine learning techniques. This section discusses the feature construction process,
their representation, types, size and the type of values in them.
5.1.1 Feature Construction
In this study, we use the term feature construction to describe the process of conversion
of raw data into meaningful features that can be used with machine learning techniques
to train models. The following subsections describes the types of information that have
been extracted to be used as features.
31
Chapter 5. Methodology 32
5.1.1.1 Permissions
Use of certain combination or setting of permissions often reflect malicious behaviour.
Therefore we have extracted two sets of permissions to identify harmful behaviour.
• Requested Permissions – All requests to any permissions (Android defined and third
party) made by an application must be declared in the manifest file. We have
retrieved all these requested permissions to use as features.
• Used Permissions – When an application requests some resource, package manager
checks whether the required permission has been granted or not. All such permis-
sions that have been declared in the manifest file are used permissions. List of Used
permissions has been considered as features.
Along with the individual permissions, we have also considered total number of requested
permissions, number of AOSP permissions, number of third party permissions as well as
total number of permissions used by an application as features.
5.1.1.2 API
Use of certain packages, classes and methods by an application could show suspicious be-
haviour. Therefore we have also considered APIs as features. There are three types of
API related features -
• API : For each method in a class belonging to an Android defined package , we have
built a string to represent its use. This list of strings have been taken as features[18]
[17].
• API Package : They represent the Android defined packages used by the application.
• Restricted API calls : They represent those API calls for which the required permis-
sion has not been requested. Use of such API calls generally implies some malicious
behaviour.
Chapter 5. Methodology 33
5.1.1.3 Application Components
Each component defines a user interface or different interfaces to the system. We have
considered all of them as features.
• Activity
• Service
• Content Providers
• Broadcast Receivers
Total number of activities, services, providers and receivers has also been taken as features.
5.1.1.4 Intents
Malware often listens to intents. Therefore we can also use intents to identify malicious
behaviours. There are 3 sets of intents that have been taken as features.
• Intent Filter : These are the Intents present in the manifest file.
• Intent Const [31] : They represent the category of the Intents that are extracted
from the dex file.
• Intent Objects : They represent all the messaging objects in the dex file using which
actions from another application component is requested.
We have also taken the total count of intent filter, intent const and intent object in an
application as features.
5.1.1.5 System Commands
When an attacker gains root privileges of the system, it can execute several commands
that can cause harm. Therefore patterns in usage of system commands can also help to
identify malicious behaviours [29].
Chapter 5. Methodology 34
5.1.1.6 Opcodes
Identifying patterns in usage of opcodes by malware extracted from the classes.dex file
can also help in Android malware detection.
5.1.1.7 Misc Features
Presence of native code, dynamic code loading, reflection, crypto code, total calls to record-
ing category, camera category etc can also help in identifying malicious behaviours.
5.1.2 Feature Representation
For each category, we have built a binary feature matrix around this collected information
where 1 indicates that the application contains a particular information and 0 indicates
otherwise. We have represented the feature matrix in sparse format using CountVectorizer.
For features in API, API Packages, System commands and Opcodes, we take the frequency
of use of the feature by an application as the value in the reduced feature set matrix.
The initial feature matrix has been represented in the row sparse format to save memory
usage. For the reduced set of features after feature selection step, a dense format have
been used.
5.1.3 Types of Feature sets
There are two types of feature sets.
• Categorical Features : It represents the categories that contain lists of strings as
values.
• Miscellaneous Features : It represents the categories that contain numerical values
usually counts.
Table 5.1 shows the list of feature sets and their type.
Chapter 5. Methodology 35
Categorical Features Miscellaneous Features
Requested Permissions Num Aosp Permissions num activity
Used Permissions Num Third party Permissions num service
API Num Requested Permissions num receiver
API Packages count binary category num providers
Restricted API count dynamic category num intent-filters
Intent Filters count crypto category num intent objects
Intent Objects count network category num intent consts
Intent Const count gps category entropy rate
System Commands count os category ascii obfuscation
Opcodes count io category count sms category
activity count recording category target sdk
service count telephony category eff target sdk
receiver count accounts category min sdk
providers count bluetooth category max sdk
count nfc category reflection code
count display category native code
count content count reflection category
count context APK is signed
count database dynamic code
Table 5.1: List of Feature Sets
5.1.4 Feature size
Each feature set contains different number of features varying from several dozens to sev-
eral lac’s. There are many features that are specific to some applications. Often it results
in a large vector space. For features like API packages, system commands etc, we have
used Android defined names which leads to smaller vector space.
Table 5.2 shows the total number of features in each feature set for three datasets.
Chapter 5. Methodology 36
Feature set/Dataset D/2010-2012 D/2013-2015 D/2016-2019
Activity 12,53,259 13,61,756 4,15,243
Service 88,575 1,25,578 41,994
Broadcast Receiver 90,081 1,14,134 33,115
Content Provider 9952 10,265 3833
Intent Filters 82,281 85,948 34,200
Intent Objects 1,19,570 1,30,259 33,461
Intent Const 14,409 17,226 4816
Requested Permissions 15,162 39,501 14,748
Used Permissions 60 60 55
Restricted API 1874 1740 398
API 38,747 52,245 68,113
API packages 179 212 232
System Commands 181 193 175
Opcodes 222 222 224
Misc 42 42 42
Total 17,14,594 19,39,381 6,50,649
Table 5.2: Number of Features in each Feature set
A model built upon gathering as much information as possible helps the classifier to learn
more. However there are several disadvantages of using large number of features to build
such models.
• Large amount of resources are needed to deal large number of features.
• The processing time to analyse these features increases.
• With increase in number of features, the computation power needed to build a model
also increases.
Such models are not feasible, scalable or efficient. Therefore we need a mechanism such
that a classifier can be trained using less number of features while maintaining similar
detection results or improving them.
Chapter 5. Methodology 37
5.2 Data Cleaning
In Android malware analysis, we often encounter two types of strings - Android defined
and custom/third party defined. Android defines a list permissions in the AOSP (Android
Open Source Project). However any app developer can build their own set of permissions
to include in the application. Similarly an activity or a service can be given any name
depending on developers choice. This results in a feature being exclusive to very few
Android applications or sometimes even one application. These custom defined strings
may or may not be informative to detection of Android malware. If for a feature, the
value in most of the samples are zero, then it means this feature is being rarely used or
not being used at all by the applications. This suggests that such features will not help to
gain much information as compared to the features that are being used a lot. When we
have a constant variable value across all samples in the dataset, it does not improve the
power of the model because it has less variance. Therefore, if a feature is being used rarely
or very less number of times as compared to the number of samples present, they can be
dropped with respect to others. Therefore we propose a data cleaning step that eliminates
a feature based on how many times it is being used in the dataset. This filtering helps to
reduce noise in the dataset as well as discard less informative features.
Keeping 1 as the minimum threshold means we consider a feature if it has even used at-
least one sample, taking 2 as the threshold means a particular feature must be used by at
least 2 samples in the whole dataset, other wise we discard it. Similarly, a threshold of 10
means a feature should be used at least 10 times.
To find the final threshold value, we increase the c Init parameter value by one at every
step and look at the resulting number of features. The c Init value after which there is
no significant drop in number of features has been considered as the final threshold value.
We also evaluate the models performance at every step after discarding features to ensure
similar detection results between the model trained using reduced set of features and the
model trained using initial set of features.
There is a huge drop in the number of features for categories like activity, service, re-
ceiver, provider, intent filter, intent object, intent const, requested and API where as used
permission, restricted API, Android API packages, system commands and opcode do not
show any significant drop in number of features. Fig 5.1 shows the trend in drop for some
categories.
Chapter 5. Methodology 38
Figure 5.1: Drop in Number of Features in Feature sets
Chapter 5. Methodology 39
5.3 Frequency Of usage
This step also relies on the count of usage of a feature by an application for the process
of elimination. For each categorical feature set, we separate both the samples according
to their labels - malware and benign into two sets and find their corresponding threshold
values (thresholdben and thresholdmal ). We calculate thresholdben by using equation 5.1
and thresholdmal by using equation 5.2. Now given a feature, if this feature is being used
less than thresholdmal in the malware set and less than thresholdben in the benign set and
the difference in their frequency of usage is less than half of current c value chosen, we
drop it. We argue that it is less likely to provide useful information to distinguish between
malware and benign as compared to other features that are used frequently. Such features
are discarded from the feature set. E.g. suppose we have 1000 malware and 1000 benign.
Now if a feature is being used once by malware and twice by benign, such features can
be considered for elimination since it does not help for better classification. However to
avoid loss of meaningful features like a feature being used once in benign and 999 times
in malware, we also look at the difference in frequency of their usage.
To construct values for the c parameter, we create a list by taking the frequency of usage of
all the features in the malware set and frequency of usage of all the features in the benign
set. We also filter out some values from the list for which there is no drop in number
of features as compared to previous c value used to create a final list. These values are
used as variables to evaluate models performance based on which an accuracy vs number
of features graph is plotted. The c value where the improvement saturates is chosen as
the required c-value. This c-value is different for each categorical set. This is one of the
benefits of having a categorical feature set.
While c-parameter tuning, we have used ExtraTree classifier as the base estimator to
evaluate the model performances. Fig 5.2, 5.3 and 5.4 shows the accuracy vs Number of
Features graph for API package, system commands and Requested Permissions categories
respectively.
thresholdben =#BenignSamples
c(5.1)
thresholdmal =#MalwareSamples
c(5.2)
Chapter 5. Methodology 40
0 25 50 75 100 125 150 175Number of Features
0.945
0.950
0.955
0.960
0.965
0.970
0.975
Accu
racy
(a) D/2010-2012
0 50 100 150 200Number of Features
0.86
0.88
0.90
0.92
0.94
0.96
0.98
Accu
racy
(b) D/2013-2015
0 50 100 150 200Number of Features
0.800
0.825
0.850
0.875
0.900
0.925
0.950
Accu
racy
(c) D/2016-2019
Figure 5.2: Accuracy vs Number of features plot for API Package based on frequencyof usage
Chapter 5. Methodology 41
0 25 50 75 100 125 150 175Number of Features
0.82
0.84
0.86
0.88
0.90
0.92
Accu
racy
(a) D/2010-2012
0 25 50 75 100 125 150 175 200Number of Features
0.70
0.75
0.80
0.85
0.90
Accu
racy
(b) D/2013-2015
0 25 50 75 100 125 150 175Number of Features
0.70
0.75
0.80
0.85
0.90
Accu
racy
(c) D/2016-2019
Figure 5.3: Accuracy vs Number of Features plot for System Commands based onfrequency of usage
Chapter 5. Methodology 42
0 100 200 300 400 500Number of Features
0.80
0.82
0.84
0.86
0.88
0.90
0.92
0.94
Accu
racy
(a) D/2010-2012
0 100 200 300 400 500 600Number of Features
0.86
0.87
0.88
0.89
0.90
0.91
0.92
0.93
0.94
Accu
racy
(b) D/2013-2015
0 100 200 300 400 500Number of Features
0.76
0.78
0.80
0.82
0.84
0.86
0.88
Accu
racy
(c) D/2016-2019
Figure 5.4: Accuracy vs Number of Features plot for Requested Permissions based onfrequency of usage
Chapter 5. Methodology 43
5.4 RFECV
Using Accuracy as the score function, we plot a graph for Accuracy vs Number of Features
using RFECV. We have taken the value for step parameter as 1. This implies RFECV
discards one feature at a time and evaluates the model performances. The features being
discarded have less importance compared to other features. The given function shows the
implementation of the RFECV function.
• selector = RFECV( estimator, step = 1, cv = StratifiedKFold(5),
scoring = accuracy )
Here, Extra Tree Classifier with the 20 estimators and random state as 84 have been used
as the base estimator.
• estimator = ExtraTreesClassifier( n estimators = 20, random state = 84,
n jobs = -1)
We have also used several attributes from RFECV to help us with the analysis.
• selector.support
• selector.grid scores
• selector.estimator
After a thorough analysis, we identify a saturation point for accuracy beyond which there
is no significant improvement in detection results. Fig 5.5, 5.6, 5.7 and 5.8 shows the
accuracy vs Number of Features graph for Requested Permissions, API package, Intent
Const and Opcodes respectively.
5.5 RFE
After identifying optimal number of features to use from the RFECV plot, we use RFE
to reduce to the initial set of features to desired number of features( say x). The RFE
function have been used with the following parameters.
• selector = RFE(estimator, x, step=1)
Chapter 5. Methodology 44
0 20 40 60 80 100 120Number of features selected
0.80
0.82
0.84
0.86
0.88
0.90
0.92
0.94
Cros
s val
idat
ion
scor
e (n
b of
cor
rect
cla
ssifi
catio
ns)
(a) D/2010-2012
0 20 40 60 80 100 120Number of features selected
0.82
0.84
0.86
0.88
0.90
0.92
0.94
Cros
s val
idat
ion
scor
e (n
b of
cor
rect
cla
ssifi
catio
ns)
(b) D/2013-2015
0 20 40 60 80 100Number of features selected
0.74
0.76
0.78
0.80
0.82
0.84
0.86
0.88
Cros
s val
idat
ion
scor
e (n
b of
cor
rect
cla
ssifi
catio
ns)
(c) D/2016-2019
Figure 5.5: Accuracy vs Number of Features plot for Requested Permissions usingRFECV
Chapter 5. Methodology 45
0 5 10 15 20 25 30 35 40Number of features selected
0.84
0.86
0.88
0.90
0.92
0.94
0.96
0.98
Cros
s val
idat
ion
scor
e (n
b of
cor
rect
cla
ssifi
catio
ns)
(a) D/2010-2012
0 10 20 30 40 50 60Number of features selected
0.80
0.85
0.90
0.95
Cros
s val
idat
ion
scor
e (n
b of
cor
rect
cla
ssifi
catio
ns)
(b) D/2013-2015
0 10 20 30 40 50Number of features selected
0.82
0.84
0.86
0.88
0.90
0.92
0.94
0.96
Cros
s val
idat
ion
scor
e (n
b of
cor
rect
cla
ssifi
catio
ns)
(c) D/2016-2019
Figure 5.6: Accuracy vs Number of Features plot for API Packages using RFECV
Chapter 5. Methodology 46
0 20 40 60 80 100 120Number of features selected
0.84
0.86
0.88
0.90
0.92
0.94
0.96
Cros
s val
idat
ion
scor
e (n
b of
cor
rect
cla
ssifi
catio
ns)
(a) D/2010-2012
0 20 40 60 80 100 120 140Number of features selected
0.70
0.75
0.80
0.85
0.90
Cros
s val
idat
ion
scor
e (n
b of
cor
rect
cla
ssifi
catio
ns)
(b) D/2013-2015
0 20 40 60 80 100Number of features selected
0.70
0.75
0.80
0.85
0.90
Cros
s val
idat
ion
scor
e (n
b of
cor
rect
cla
ssifi
catio
ns)
(c) D/2016-2019
Figure 5.7: Accuracy vs Number of Features plot for Intent Const using RFECV
Chapter 5. Methodology 47
0 10 20 30 40 50 60 70Number of features selected
0.775
0.800
0.825
0.850
0.875
0.900
0.925
0.950
0.975
Cros
s val
idat
ion
scor
e (n
b of
cor
rect
cla
ssifi
catio
ns)
(a) D/2010-2012
0 20 40 60 80 100Number of features selected
0.70
0.75
0.80
0.85
0.90
0.95
Cros
s val
idat
ion
scor
e (n
b of
cor
rect
cla
ssifi
catio
ns)
(b) D/2013-2015
0 20 40 60 80 100Number of features selected
0.80
0.82
0.84
0.86
0.88
0.90
0.92
0.94
0.96
Cros
s val
idat
ion
scor
e (n
b of
cor
rect
cla
ssifi
catio
ns)
(c) D/2016-2019
Figure 5.8: Accuracy vs Number of Features plot for Opcodes using RFECV
Chapter 5. Methodology 48
5.6 Learning Final Model
We have combined the train and evaluation splits of the D/2010-2012, D/2013-2015 &
D/2016-2019 datasets and trained the final model which detects the Android malware
over time. It is to be noted at any process we did not look at the test split of any
dataset. This final model is trained using the 297 identified features that are critical and
relevant over the years. The performance of this model is shown in section 6.3.2. We show
the performance of our model using different metrics like accuracy, precision, recall and
confusion matrix so that the model’s robust performance can be shown.
5.7 Effectiveness of Identified Features
We have analyzed the instances occurring over time and have identified the only relevant
features - these are the features that are the most informative and important for malware
detection across different years. These features can be used with any set of samples to train
a model for a proper classification. To show the effectiveness of the identified features, we
train two models using train set from the D/DREBIN and D/AMD datasets based on the
identified set of features and evaluate the performance of the models on the unseen test
set of D/DREBIN and D/AMD datasets respectively.
Summary: This chapter provides the implementation details for various modules and
the process & steps with in the modules. It also provides explanation and reasoning for
the performed operations. It also provides the parameters for implementation of various
functions.
Chapter 6
Experimentation and Results
Chapter 6 discusses the experimental setup, analysis results and final model results.
6.1 Experimental setup
This Section discusses the system configuration, parameters for the machine learning clas-
sifiers and datasets distribution.
6.1.1 System
The dataset downloading and code creation have been done on a machine with the config-
uration of - Linux 16.04 , 16 GB RAM, 7 TB Storage and 8 cores. All other process has
been run on a System with the configuration - Linux 16.04.6, 32 GB RAM, 1 TB HDD +
128 GB SSD and 24 cores.
6.1.2 Machine Learning Classifiers
The following classifiers with the given parameters have been used for training the models
during the analysis phase. By observing the performance of each classifier shown in section
6.3.1, we choose Extra Tree classifier as the best one for further in-depth analysis and to
train the final model.
49
Chapter 6. Analysis 50
• LogisticRegression : LogisticRegression(random state = 84)
• RandomForest : RandomForestClassifier(n estimators = 20, random state = 84)
• NeuralNetwork : neural network.MLPClassifier(hidden layer sizes = 3)
• ExtraTrees : ExtraTreesClassifier(n estimators = 20, random state = 84,n jobs=-1)
All the parameters for these classifiers have been randomly chosen without looking at the
evaluation results or by resorting to parameter tuning. However to train the final model,
we tune the “number of estimators” parameter in the ExtraTree classifier.
6.1.3 Dataset Partition
To analyse the proposed framework and to measure the model performances, we have split
the datasets into 3 categories - Train set, Evaluation set and Test set.
• Train set - Train set contains the set of samples which have been used to train/fit
the model.
• Evaluation set - Evaluation set is the set of samples on which frequent evaluation of
the trained model have been performed to improve model performances.
• Test set - Test set contains the set of samples which have been used to get the
unbiased performance of the final model.
The D/2010-2012, D/2013-2015 and D/2016-2019 datasets have been split into 0.60, 0.20
and 0.20 ratio to get the train set, the evaluation set and the test set respectively. We
have used the random.sample() function from the random package in python to obtain the
samples for the test set(20 % of samples). Then, numpy train test split function with 20
as the random state value have been used to split the remaining samples set into 0.75 and
0.25 ratio to get the desired samples for the train and the evaluation set.
We have combined the train set and evaluation set of D/2010-2012, D/2013-2015 and
D/2016-2019 datasets to create a set for D/2010-2019. This set have been split into
0.75 and 0.25 to obtain the train and the evaluation set respectively. The split has been
performed using numpy train test split function . The test set in D/2010-2012, D/2013-
2015 and D/2016-2019 datasets have been used as test sets for D/2010-2019 dataset.
Chapter 6. Analysis 51
The D/AMD dataset has been split into 0.75 and 0.25 ratio where 75% samples have been
used as train set and the remaining 25% samples have been used as test set. The D/AMD
dataset does not contain the evaluation set.
The D/DREBIN dataset has also been split into 0.75 and 0.25 ratio where 75% samples
have been used as train set and the remaining 25% samples have been used as test set.
Table 6.1 shows the summary of the partition.
DatasetTrain
Mal + BenEvaluationMal + Ben
Total(Train+Evaluation)Mal + Ben
TestMal + Ben
D/2010-2012 2,32,342 77,4483,09,790
1,54,890 + 1,54,90077,446
38,722 + 38,724
D/2013-2015 2,22,177 74,0592,96,236
1,48,145 + 1,48,09174,058
37,036 + 37,022
D/2016-2019 40,141 13,38153,522
26,753 + 26,76913,379
6,688 + 6,691
D/2010-2019 4,94,660 1,64,8886,59,548
3,29,788 + 3,29,7601,64,883
82,446 + 82,437
D/AMD 36,561 -36,561
18,256 + 17,69512,188
6,004 + 6,184
D/DREBIN 8286 -8,286
4,111 + 4,1752,762
1,368 + 1,394
Table 6.1: Partition of Datasets
6.2 Evaluation Metrics
To evaluate the performances of the model during analysis phase and to demonstrate
the final model performances, we have used several popular evaluation metrics that are
commonly used in machine learning. The following subsection provides an overview on the
various metrics that have been used for measuring model performances [14]. This overview
best describes the metrics based on our work.
6.2.1 Confusion Matrix
Confusion Matrix is used to show the performance results of an estimator on evalua-
tion/test set as shown in the Table 6.2.
Chapter 6. Analysis 52
• True Positives (TP) : In this study, TP represents the correctly classified malware
samples.
• True Negative (TN) : TN represents the correctly classified benign samples.
• False Positive (FP) : FP represents those benign samples that have been classified
as malware.
• False Negative (FN) : FN represents those malware samples that have been classified
as benign.
Predicted Class
Actual ClassClass = No Class = Yes
Class = No TN FPClass = Yes FN TP
Table 6.2: Confusion Matrix
6.2.2 Accuracy
Accuracy is the ratio of correctly classified samples(benign and malware) to the total
number of samples. It is calculated by using eq. 6.1.
Accuracy =TP + TN
TP + FP + TN + FN(6.1)
6.2.3 Precision
Precision is measure of correctly classified malware to total samples that have been clas-
sified as malware. It is calculated by using eq. 6.2.
Precision =TP
TP + FP(6.2)
6.2.4 Recall
Recall is the measure of total samples that have been classified as malware to total malware
samples. It is calculated by using eq. 6.3.
Recall =TP
TP + FN(6.3)
Chapter 6. Analysis 53
6.2.5 F1 Score
F1 Score is the harmonic mean of precision and recall which says how precise and robust
the model is. It is calculated by using eq. 6.4.
F1Score = 2 ∗ 11
Precision + 1Recall
(6.4)
6.3 Results
6.3.1 Analysis Results
This section discusses the results of the analysis. For all the tables,
• I - I represents the total number of features in the initial set for a category.
• F - F represents the total number of features in the final set for a category.
The analysis have been performed on following categories -
• Requested Permissions , Used Permissions
• Restricted API, API, API Packages
• Service, Receiver, Activity, Providers
• Intent Filters, Intent Objects, Intent Const
• System Commands
• Opcodes
• Misc Features
Chapter 6. Analysis 54
6.3.1.1 Requested Permissions
The features in requested permission category shows good detection results for both the
initial set and the final set of features in all the datasets. The size of feature set for the
datasets have been reduced by 261, 658 and 247 times respectively while still maintaining
similar detection results. Therefore we consider this category to train our final model. Ta-
ble 6.3 shows the evaluation results and feature size (total number of features) for initial
set of features.
Dataset
Classifiers
D/2010-2012 D/2013-2015 D/2016-2019
I: 15,162 I: 39,501 I: 14,748
Acc Prec Rec F1 Acc Prec Rec F1 Acc Prec Rec F1
LR 88.96 88.04 90.31 89.16 89.64 88.83 90.79 89.8 82.06 87.07 75.23 80.72
RF 93.78 92.21 95.71 93.93 94.39 94.47 94.35 94.41 88.54 94.41 81.9 87.71
NN 90.51 89.85 91.46 90.65 90.87 91.77 89.9 90.82 82.76 90.89 72.75 80.81
ET 93.79 92.35 95.57 93.94 94.42 94.7 94.17 94.43 88.7 94.84 81.81 87.84
Table 6.3: Performance results on evaluation set for requested permissions using initialfeatures
Table 6.4 shows the evaluation results and feature size (total number of features) for final
set of features.
Dataset
Classifiers
D/2010-2012 D/2013-2015 D/2016-2019
F: 58 F: 60 F: 59
Acc Prec Rec F1 Acc Prec Rec F1 Acc Prec Rec F1
LR 88.32 87.34 89.78 88.55 88.98 87.99 90.4 89.18 81.17 85.53 74.96 79.9
RF 93.59 91.96 95.6 93.75 94.17 94.2 94.2 94.2 88.33 94.18 81.67 87.48
NN 89.02 87.74 90.86 89.27 90.46 89.73 91.48 90.6 85.17 91.05 77.96 83.99
ET 93.56 92.16 95.29 93.7 94.2 94.48 93.95 94.21 88.33 94.36 81.49 87.46
Table 6.4: Performance results on evaluation set for requested permissions using finalfeatures
Chapter 6. Analysis 55
6.3.1.2 Used Permissions
The used permission category have small set of features in both the initial & the final
feature set. The feature set size have also been reduced to less than half while maintain-
ing similar detection results. Although the used permission category shows some decent
results, the performance for both the initial set of features and the final set of features are
not good enough. Used permission category shows good results for accuracy and precision
for all the datasets but the results for other evaluation metrics are relatively poor. There
is also large inconsistency between different evaluation metrics. Table 6.5 shows the per-
formance results for used category using initial set of features.
Dataset
Classifiers
D/2010-2012 D/2013-2015 D/2016-2019
I: 60 I: 60 I: 55
Acc Prec Rec F1 Acc Prec Rec F1 Acc Prec Rec F1
LR 85.81 83.6 89.3 86.36 78.91 81.71 74.76 78.08 81.26 85.08 75.74 80.14
RF 87.71 86.78 89.14 87.95 82.48 87.11 76.44 81.43 84.19 90.78 76.04 82.76
NN 86.24 84.63 88.77 86.65 79.76 84.64 72.96 78.37 81.37 85.2 75.85 80.25
ET 87.74 86.94 89.0 87.96 82.52 87.27 76.36 81.45 84.13 90.81 75.89 82.68
Table 6.5: Performance results on evaluation set for used permissions using initial fea-tures
Table 6.6 shows the performance results for used category using final set of features.
Dataset
Classifiers
D/2010-2012 D/2013-2015 D/2016-2019
I: 19 F: 32 I: 24
Acc Prec Rec F1 Acc Prec Rec F1 Acc Prec Rec F1
LR 85.63 83.36 89.22 86.19 78.85 82.22 73.9 77.84 81.22 85.02 75.71 80.1
RF 87.46 86.95 88.33 87.63 82.57 87.23 76.52 81.52 84.13 90.79 75.89 82.68
NN 85.95 82.14 92.06 86.82 80.12 84.45 74.08 78.92 82.87 87.86 76.21 81.62
ET 87.52 86.63 88.9 87.75 82.46 87.54 75.91 81.3 84.1 90.8 75.83 82.64
Table 6.6: Performance results on evaluation set for used permissions using final features
Chapter 6. Analysis 56
6.3.1.3 Restricted API
The initial feature set have been reduced by at-least 5 times. The restricted API category
shows relatively better detection results for recent samples as compared to older samples.
The Recall value of D/2013-2015 and 2016-2019 is also best among all other metrics. How-
ever, the overall detection results are not good enough. we do not consider this feature
set because of inconsistent results in the datasets and the evaluation metrics. Table 6.7
shows the performance results for restricted API category using initial set of features.
Dataset
Classifiers
D/2010-2012 D/2013-2015 D/2016-2019
I: 1874 I: 1740 I: 398
Acc Prec Rec F1 Acc Prec Rec F1 Acc Prec Rec F1
LR 74.34 77.37 69.23 73.08 77.33 71.98 89.89 79.94 83.6 81.0 87.71 84.22
RF 79.82 83.17 75.07 78.91 82.79 76.64 94.59 84.67 89.63 87.49 92.44 89.89
NN 76.51 79.91 71.2 75.3 80.23 74.68 91.77 82.35 86.76 85.35 88.71 87.0
ET 79.8 83.18 75.0 78.88 82.76 76.7 94.37 84.62 89.73 87.66 92.44 89.99
Table 6.7: Performance results on evaluation set for restricted API using initial features
Table 6.8 shows the performance results for restricted API category using final set of fea-
tures.
Dataset
Classifiers
D/2010-2012 D/2013-2015 D/2016-2019
F: 66 F: 60 F: 70
Acc Prec Rec F1 Acc Prec Rec F1 Acc Prec Rec F1
LR 73.05 75.99 67.86 71.7 77.33 71.98 89.89 79.94 82.39 79.8 86.66 83.09
RF 79.45 82.93 74.47 78.47 82.79 76.64 94.59 84.67 89.43 87.11 92.51 89.73
NN 75.75 78.12 71.93 74.9 80.23 74.68 91.77 82.35 86.1 82.34 91.85 86.83
ET 79.45 82.97 74.41 78.46 82.76 76.7 94.37 84.62 89.44 87.18 92.44 89.73
Table 6.8: Performance results on evaluation set for restricted API using final features
Chapter 6. Analysis 57
6.3.1.4 Service
The initial feature size of service is very large compared to above categories. The feature
set size have been reduced by 89, 138 and 44 times respectively and brought down below
1000 features. This category shows good detection results for precision in samples after
2012 but performs poorly for prior samples. The results using different evaluation metric
are also inconsistent. Therefore we do not consider this set. Table 6.9 shows the perfor-
mance results for service category using initial set of features.
Dataset
Classifiers
D/2010-2012 D/2013-2015 D/2016-2019
I: 88,575 I: 1,25,578 I: 41,994
Acc Prec Rec F1 Acc Prec Rec F1 Acc Prec Rec F1
LR 78.1 98.04 57.6 72.57 85.05 97.87 71.82 82.84 82.32 96.83 66.76 79.03
RF 78.63 97.92 58.75 73.44 86.0 98.27 73.44 84.06 83.1 96.66 68.51 80.19
NN 78.7 97.82 58.96 73.58 86.3 97.95 74.29 84.49 83.17 96.28 68.94 80.35
ET 78.65 97.96 58.76 73.46 86.11 98.36 73.59 84.19 83.19 96.73 68.63 80.29
Table 6.9: Performance results on evaluation set for service using initial features
Table 6.10 shows the performance results for service category using final set of features.
Dataset
Classifiers
D/2010-2012 D/2013-2015 D/2016-2019
F: 993 F: 907 F: 938
Acc Prec Rec F1 Acc Prec Rec F1 Acc Prec Rec F1
LR 76.35 98.22 53.95 69.65 79.86 97.77 61.32 75.37 79.41 96.27 61.11 74.76
RF 76.37 98.2 54.0 69.68 79.95 97.68 61.56 75.52 79.69 96.5 61.53 75.15
NN 76.36 98.2 53.99 69.67 79.89 97.46 61.59 75.48 79.54 96.27 61.38 74.96
ET 76.37 98.2 54.0 69.68 79.94 97.63 61.59 75.53 79.65 96.56 61.41 75.08
Table 6.10: Performance results on evaluation set for service using final features
Chapter 6. Analysis 58
6.3.1.5 Receiver
The initial feature size of receiver is also very large. The feature set size have been reduced
by 95, 123 and 36 times and brought down below 1000 features. Similar to service cate-
gory, this category also shows good detection results for precision. It shows some decent
detection results for samples in between 2013 and 2015 but performs poorly for samples
before 2013 and after 2015. Therefore we do not consider this set for final model. Table
6.11 shows the performance results for receiver category using initial set of features.
Dataset
Classifiers
D/2010-2012 D/2013-2015 D/2016-2019
I: 90,081 I: 1,14,134 I: 33,115
Acc Prec Rec F1 Acc Prec Rec F1 Acc Prec Rec F1
LR 72.3 98.19 45.77 62.43 86.05 98.46 73.38 84.09 79.75 96.93 61.38 75.16
RF 72.31 97.83 45.96 62.54 86.97 98.79 74.98 85.25 79.96 97.1 61.68 75.44
NN 72.48 98.1 46.16 62.78 87.22 98.36 75.84 85.64 79.83 96.93 61.53 75.28
ET 72.3 97.87 45.91 62.5 87.15 98.8 75.34 85.49 79.9 97.21 61.49 75.33
Table 6.11: Performance results on evaluation set for receiver using initial features
Table 6.12 shows the performance results for receiver category using final set of features.
Dataset
Classifiers
D/2010-2012 D/2013-2015 D/2016-2019
F: 947 F: 927 F: 912
Acc Prec Rec F1 Acc Prec Rec F1 Acc Prec Rec F1
LR 70.31 98.46 41.61 58.5 80.9 98.17 63.18 76.88 75.91 96.48 53.7 68.99
RF 70.35 98.54 41.65 58.56 80.96 98.21 63.26 76.96 76.04 96.52 53.94 69.2
NN 70.35 98.54 41.65 58.55 80.91 98.15 63.21 76.9 75.96 96.56 53.74 69.05
ET 70.35 98.56 41.65 58.55 80.97 98.33 63.21 76.95 76.0 96.54 53.85 69.13
Table 6.12: Performance results on evaluation set for receiver using final features
Chapter 6. Analysis 59
6.3.1.6 Activity
The initial feature size of activity is very large for all the datasets. The feature set size
have been reduced by 1313, 1406 and 457 times and brought down below 1000 features.
This category shows good detection results for initial set of samples. However even though
the final set contains more than 1000 features, the detection results shows significant drop.
To maintain similar detection results between initial set and final set of features, the size
of final set should be large. Therefore we do not consider this feature set due to large size
of final set of features. Table 6.13 shows the detection results for activity using initial set
of features.
Dataset
Classifiers
D/2010-2012 D/2013-2015 D/2016-2019
I: 12,53,259 I: 13,61,756 I: 4,15,243
Acc Prec Rec F1 Acc Prec Rec F1 Acc Prec Rec F1
LR 91.37 97.29 85.21 90.85 92.95 98.19 87.51 92.51 86.85 96.21 76.68 85.34
RF 91.52 96.86 85.92 91.06 93.07 97.94 88.05 92.73 86.25 96.29 75.35 84.54
NN 90.69 97.44 83.68 90.04 93.81 98.40 89.12 93.53 87.54 96.2 78.12 86.22
ET 91.61 96.87 86.11 91.17 93.44 98.03 88.73 93.15 86.91 95.82 77.15 85.47
Table 6.13: Performance results on evaluation set for activity using initial features
Table 6.14 shows the detection results for activity using final set of features.
Dataset
Classifiers
D/2010-2012 D/2013-2015 D/2016-2019
F: 954 F: 968 F:908
Acc Prec Rec F1 Acc Prec Rec F1 Acc Prec Rec F1
LR 88.67 97.08 79.87 87.64 85.28 96.15 73.66 83.41 80.14 95.37 63.27 76.07
RF 88.9 96.85 80.55 87.95 85.67 96.11 74.5 83.94 80.88 96.08 64.3 77.04
NN 88.82 96.99 80.26 87.84 85.41 96.12 73.96 83.6 80.2 95.48 63.33 76.15
ET 88.89 96.9 80.49 87.94 85.66 96.17 74.44 83.92 80.87 96.12 64.26 77.02
Table 6.14: Performance results on evaluation set for activity using final features
Chapter 6. Analysis 60
6.3.1.7 Providers
The set of features have been reduced to nearly 10 times smaller set. This category does
not good detection results at all. Therefore, we do not consider this feature set for our
model. Table 6.15 shows the performance results for providers category using initial set
of features.
Dataset
Classifiers
D/2010-2012 D/2013-2015 D/2016-2019
I: 9952 I: 10,265 I: 3833
Acc Prec Rec F1 Acc Prec Rec F1 Acc Prec Rec F1
LR 51.01 87.09 3.03 5.86 53.95 97.24 8.6 15.8 60.42 55.97 97.02 70.99
RF 51.61 50.96 99.82 67.48 53.94 97.03 8.61 15.82 60.33 55.91 97.02 70.94
NN 51.62 50.97 99.82 67.48 53.95 97.21 8.61 15.83 60.43 55.98 97.01 70.99
ET 51.61 50.97 99.81 67.48 53.95 97.26 8.6 15.8 60.33 55.91 97.02 70.94
Table 6.15: Performance results on evaluation set for providers using initial features
Table 6.16 shows the performance results for providers category using final set of features.
Dataset
Classifiers
D/2010-2012 D/2013-2015 D/2016-2019
F: 444 F: 418 F: 443
Acc Prec Rec F1 Acc Prec Rec F1 Acc Prec Rec F1
LR 50.82 86.58 2.6 5.05 50.91 85.4 2.66 5.16 60.12 55.78 96.95 70.81
RF 50.82 86.54 2.61 5.06 50.93 84.77 2.73 5.28 60.07 55.75 96.95 70.79
NN 50.82 86.58 2.6 5.05 50.92 85.24 2.69 5.21 60.13 55.79 96.95 70.82
ET 50.82 86.58 2.6 5.05 50.92 85.02 2.7 5.23 60.09 55.76 96.92 70.79
Table 6.16: Performance results on evaluation set for providers using final features
Chapter 6. Analysis 61
6.3.1.8 Intent Filters
The initial feature size of Intent filter is also very large. The feature set size have been
reduced by 87,88 and 37 times brought down below 1000 features. This category also
shows good detection results for samples after 2012 but performs poorly for prior samples.
Therefore we do not consider this set for final model. Table 6.17 shows the performance
results for this category using initial features.
Dataset
Classifiers
D/2010-2012 D/2013-2015 D/2016-2019
I: 82,281 I: 85,948 I: 34,200
Acc Prec Rec F1 Acc Prec Rec F1 Acc Prec Rec F1
LR 77.94 90.29 62.9 74.14 87.91 92.23 82.93 87.33 79.41 92.17 64.2 75.68
RF 78.89 92.1 63.46 75.14 89.11 93.66 84.02 88.58 81.33 91.73 68.79 78.62
NN 78.41 90.84 63.46 74.72 88.6 93.28 83.31 88.01 79.88 90.62 66.58 76.76
ET 78.9 92.19 63.42 75.15 89.0 93.49 83.95 88.46 81.43 92.12 68.66 78.68
Table 6.17: Performance results on evaluation set for Intent Filters using initial features
Table 6.18 shows the performance results for intent filters category using final set of fea-
tures.
Dataset
Classifiers
D/2010-2012 D/2013-2015 D/2016-2019
F: 938 F: 973 F: 916
Acc Prec Rec F1 Acc Prec Rec F1 Acc Prec Rec F1
LR 76.95 89.34 61.5 72.85 86.24 93.81 77.76 85.03 78.67 91.07 63.49 74.82
RF 78.27 91.54 62.57 74.33 88.85 93.57 83.55 88.28 80.16 91.86 66.1 76.88
NN 77.64 90.21 62.31 73.71 88.12 93.31 82.25 87.44 79.48 92.78 63.85 75.64
ET 78.27 91.62 62.5 74.31 88.82 93.63 83.42 88.23 80.11 92.01 65.87 76.78
Table 6.18: Performance results on evaluation set for Intent Filters using final features
Chapter 6. Analysis 62
6.3.1.9 Intent Objects
The initial feature size of Intent object is also very large. The feature set size have been
reduced by 123, 131 and 33 times and brought down below 1000 features. This category
does not good detection results for samples in all the datasets. Therefore due to poor
results and large size of final set of features, we do not consider this set for final model.
Table 6.19 shows the detection results on the initial set of features.
Dataset
Classifiers
D/2010-2012 D/2013-2015 D/2016-2019
I: 1,19,570 I: 1,30,259 I: 33,461
Acc Prec Rec F1 Acc Prec Rec F1 Acc Prec Rec F1
LR 65.19 98.2 31.36 47.54 66.07 66.57 65.24 65.9 76.77 76.67 76.83 76.75
RF 65.18 98.14 31.36 47.54 66.63 66.69 67.12 66.91 78.67 79.37 77.37 78.36
NN 65.19 97.92 31.45 47.61 66.59 66.78 66.72 66.75 78.01 78.65 76.77 77.7
ET 65.17 98.15 31.33 47.5 66.63 66.68 67.13 66.9 78.75 79.48 77.42 78.43
Table 6.19: Performance results on evaluation set for Intent Objects using initial features
Table 6.20 shows the detection results on the final set of features.
Dataset
Classifiers
D/2010-2012 D/2013-2015 D/2016-2019
F: 972 F: 994 F: 986
Acc Prec Rec F1 Acc Prec Rec F1 Acc Prec Rec F1
LR 63.41 98.17 27.76 43.28 63.44 64.51 60.57 62.48 72.51 73.12 71.04 72.06
RF 63.42 98.26 27.76 43.29 64.18 64.82 62.83 63.81 75.52 69.8 89.79 78.54
NN 63.43 98.3 27.76 43.29 63.89 65.24 60.24 62.64 74.73 69.31 88.62 77.78
ET 63.42 98.18 27.77 43.3 64.19 64.84 62.8 63.8 75.51 69.8 89.76 78.53
Table 6.20: Performance results on evaluation set for Intent Objects using final features
Chapter 6. Analysis 63
6.3.1.10 Intent Const
Intent const category shows very good detection results for all the datasets. It also shows
consistency in different evaluation metrics. The finals set of features have also been re-
duced by 240, 191 and 72 times and brought down below 100 features. Therefore, we
consider this set to train our final detection model. Table 6.21 shows the detection results
for intent const category using initial set of features.
Dataset
Classifiers
D/2010-2012 D/2013-2015 D/2016-2019
I: 14,409 I: 17,226 I: 4816
Acc Prec Rec F1 Acc Prec Rec F1 Acc Prec Rec F1
LR 94.19 94.77 93.61 94.19 86.24 86.18 86.47 86.33 87.44 87.6 87.18 87.39
RF 96.39 96.16 96.68 96.42 91.54 90.25 93.23 91.72 93.01 93.04 92.95 93.0
NN 95.33 94.88 95.9 95.39 88.36 88.17 88.75 88.46 90.37 89.83 91.0 90.41
ET 96.36 96.23 96.55 96.39 91.56 90.24 93.31 91.75 93.21 93.29 93.08 93.19
Table 6.21: Performance results on evaluation set for Intent Const using initial features
Table 6.22 shows the detection results for intent const category using final set of features.
Dataset
Classifiers
D/2010-2012 D/2013-2015 D/2016-2019
F: 60 F: 90 F: 66
Acc Prec Rec F1 Acc Prec Rec F1 Acc Prec Rec F1
LR 93.65 94.35 92.94 93.64 85.31 85.0 85.94 85.47 85.17 85.81 84.2 85.0
RF 96.29 96.05 96.6 96.32 91.33 89.78 93.38 91.55 92.53 92.42 92.62 92.52
NN 94.71 93.8 95.82 94.8 87.16 86.79 87.81 87.3 87.86 87.98 87.65 87.81
ET 96.27 96.12 96.48 96.3 91.39 90.0 93.22 91.58 92.63 92.58 92.66 92.62
Table 6.22: Performance results on evaluation set for Intent Const using final features
Chapter 6. Analysis 64
6.3.1.11 API
API category also shows very good detection results and consistency in different evaluation
metrics for all the datasets. The finals set of features have also been reduced by 2152,
2612 and 2270 times and brought down below 30 features. Therefore, we consider this set
to train our final detection model. Table 6.23 shows the detection results for API category
using initial set of features.
Dataset
Classifiers
D/2010-2012 D/2013-2015 D/2016-2019
I: 38,747 I: 52,245 I: 68,113
Acc Prec Rec F1 Acc Prec Rec F1 Acc Prec Rec F1
LR 97.82 97.35 98.34 97.84 97.47 97.05 97.95 97.5 97.5 97.2 97.81 97.51
RF 97.93 97.93 97.96 97.94 97.62 97.72 97.54 97.63 97.03 97.02 97.04 97.03
NN 97.74 97.64 97.87 97.76 97.8 97.24 98.41 97.82 97.32 97.02 97.63 97.33
ET 97.92 97.94 97.93 97.93 97.53 97.8 97.26 97.53 97.03 97.15 96.9 97.02
Table 6.23: Performance results on evaluation set for API using initial features
Table 6.24 shows the detection results for API category using final set of features.
Dataset
Classifiers
D/2010-2012 D/2013-2015 D/2016-2019
F: 18 F: 20 F: 30
Acc Prec Rec F1 Acc Prec Rec F1 Acc Prec Rec F1
LR 89.61 92.76 86.07 89.29 83.18 88.45 76.52 82.05 90.33 89.72 91.06 90.38
RF 97.72 97.56 97.9 97.73 97.12 97.31 96.96 97.13 96.83 96.88 96.77 96.82
NN 89.34 93.39 84.8 88.89 88.19 86.3 90.93 88.56 91.22 91.36 91.02 91.19
ET 97.66 97.57 97.78 97.68 97.21 97.42 97.02 97.22 96.95 97.04 96.84 96.94
Table 6.24: Performance results on evaluation set for API using final features
Chapter 6. Analysis 65
6.3.1.12 API packages
The API package category shows very good detection results with very less number of fea-
tures in the final set. The performance results are also consistent throughout the datasets
and for all evaluation metrics. Therefore we consider this category for our final model.
Table 6.25 shows the performance results for this category using initial set of features.
Dataset
Classifiers
D/2010-2012 D/2013-2015 D/2016-2019
I: 179 I: 212 I: 232
Acc Prec Rec F1 Acc Prec Rec F1 Acc Prec Rec F1
LR 91.01 91.78 90.21 90.99 87.67 88.83 86.33 87.56 85.71 87.42 83.36 85.34
RF 97.48 97.63 97.35 97.49 97.67 97.93 97.42 97.67 96.25 96.62 95.84 96.23
NN 90.25 87.05 94.7 90.34 90.54 90.2 90.37 90.28 49.79 49.84 98.14 66.11
ET 97.31 97.7 96.93 97.32 97.55 98.0 97.11 97.55 96.2 96.73 95.63 96.17
Table 6.25: Performance results on evaluation set for API packages using initial features
Table 6.26 shows the performance results for this category using final set of features.
Dataset
Classifiers
D/2010-2012 D/2013-2015 D/2016-2019
F: 15 F: 20 F: 25
Acc Prec Rec F1 Acc Prec Rec F1 Acc Prec Rec F1
LR 88.25 91.11 84.91 87.9 80.18 83.92 74.91 79.16 82.77 80.79 85.89 83.26
RF 97.31 97.31 97.35 97.33 97.27 97.57 96.98 97.27 96.26 96.52 95.96 96.24
NN 90.35 89.13 92.03 90.56 77.49 89.54 62.51 73.62 61.69 98.99 23.48 37.96
ET 97.22 97.39 97.07 97.23 97.35 97.64 97.08 97.36 96.23 96.82 95.6 96.2
Table 6.26: Performance results on evaluation set for API packages using final features
Chapter 6. Analysis 66
6.3.1.13 System Commands
The System Commands category also shows very good detection results in all the datasets
for all the evaluation metrics with very less number features. Therefore we consider this
set for our final model. Table 6.27 shows the performance results for this category using
initial set of features.
Dataset
Classifiers
D/2010-2012 D/2013-2015 D/2016-2019
I-: 181 I: 193 I: 175
Acc Prec Rec F1 Acc Prec Rec F1 Acc Prec Rec F1
LR 82.04 85.46 77.48 81.27 80.26 86.74 71.68 78.49 81.72 82.73 80.1 81.39
RF 92.48 92.82 92.18 92.5 91.83 95.14 88.24 91.56 92.03 91.68 92.42 92.05
NN 86.5 89.19 83.24 86.11 83.09 85.65 79.69 82.57 85.35 85.45 85.15 85.3
ET 92.47 93.01 91.93 92.47 91.73 95.09 88.1 91.46 92.03 91.84 92.21 92.03
Table 6.27: Performance results on evaluation set for system commands using initialfeatures
Table 6.28 shows the performance results for this category using final set of features.
Dataset
Classifiers
D/2010-2012 D/2013-2015 D/2016-2019
F: 42 F:42 F: 42
Acc Prec Rec F1 Acc Prec Rec F1 Acc Prec Rec F1
LR 81.36 84.84 76.62 80.52 78.75 84.74 70.38 76.9 80.31 81.12 78.9 80.0
RF 92.25 92.85 91.65 92.25 91.51 94.86 87.87 91.23 91.93 91.64 92.24 91.94
NN 86.73 88.04 85.19 86.59 82.94 87.43 77.15 81.97 84.3 85.91 81.99 83.9
ET 92.19 92.99 91.36 92.17 91.45 94.88 87.71 91.15 91.82 91.56 92.11 91.83
Table 6.28: Performance results on evaluation set for system commands using Finalfeatures
Chapter 6. Analysis 67
6.3.1.14 Opcodes
The final feature set size in opcode have been reduced below 50. The Opcode category also
shows very good detection results with very less number of features. The results are also
consistent. Therefore we consider this set for our final model. Table ?? shows performance
results on evaluation set for opcodes using initial features.
Dataset
Classifiers
D/2010-2012 D/2013-2015 D/2016-2019
I: 222 I: 222 I: 224
Acc Prec Rec F1 Acc Prec Rec F1 Acc Prec Rec F1
LR 88.26 85.98 91.59 88.7 85.66 84.97 86.82 85.88 85.91 89.05 81.82 85.28
RF 96.95 97.46 96.45 96.95 96.74 97.15 96.35 96.75 95.95 96.44 95.4 95.92
NN 91.84 90.0 94.25 92.07 87.14 82.91 93.72 87.99 50.01 49.96 99.94 66.62
ET 96.79 97.46 96.12 96.79 96.58 97.12 96.05 96.58 95.8 96.59 94.94 95.76
Table 6.29: Performance results on evaluation set for opcodes using initial features
Table 6.30 shows performance results on evaluation set for opcodes using final features.
Dataset
Classifiers
D/2010-2012 D/2013-2015 D/2016-2019
F: 30 I: 42 F: 30
Acc Prec Rec F1 Acc Prec Rec F1 Acc Prec Rec F1
LR 86.07 85.95 86.42 86.19 81.18 79.5 84.27 81.82 81.19 85.63 74.87 79.89
RF 96.42 97.09 95.76 96.42 96.26 96.56 95.99 96.27 95.16 95.72 94.53 95.13
NN 85.44 83.21 89.01 86.01 83.28 82.22 85.14 83.66 61.72 98.5 23.66 38.16
ET 96.4 97.18 95.6 96.39 96.29 96.68 95.9 96.29 95.28 96.07 94.4 95.23
Table 6.30: Performance results on evaluation set for opcodes using final features
Chapter 6. Analysis 68
6.3.1.15 Misc Features
The Misc Features category also shows very good detection results with very less number
features. Therefore we consider this set for our final model. Table 6.31 shows the detection
results for this category using initial set of features.
Dataset
Classifiers
D/2010-2012 D/2013-2015 D/2016-2019
I: 42 I: 42 I: 42
Acc Prec Rec F1 Acc Prec Rec F1 Acc Prec Rec F1
LR 88.79 90.42 86.91 88.63 88.32 89.79 86.61 88.17 86.56 87.02 85.86 86.44
RF 97.1 96.94 97.31 97.12 97.45 97.28 97.66 97.47 96.62 97.01 96.2 96.6
NN 90.89 90.34 91.7 91.01 90.63 88.84 93.04 90.89 83.39 87.42 77.93 82.4
ET 97.07 97.07 97.12 97.09 97.41 97.41 97.44 97.43 96.55 97.13 95.93 96.53
Table 6.31: Performance results on evaluation set for Misc Features using initial features
Table 6.32 shows the detection results for this category using final set of features.
Dataset
Classifiers
D/2010-2012 D/2013-2015 D/2016-2019
F: 24 I: 26 I: 30
Acc Prec Rec F1 Acc Prec Rec F1 Acc Prec Rec F1
LR 88.02 89.19 86.7 87.92 87.94 89.39 86.25 87.79 86.16 86.55 85.56 86.05
RF 97.04 96.84 97.28 97.06 97.46 97.24 97.73 97.48 96.5 96.87 96.08 96.47
NN 90.34 89.69 91.3 90.48 88.23 89.85 86.34 88.06 50.2 50.05 99.87 66.68
ET 96.95 96.98 96.95 96.97 97.33 97.23 97.46 97.34 96.67 97.26 96.03 96.64
Table 6.32: Performance results on evaluation set for Misc Features using final features
Chapter 6. Analysis 69
6.3.2 Final Results
Based on the category wise results analysis, we identify 7 categories as relevant. We use
D/2010-2019 dataset and Extra Tree Classifier from here on for further analysis and for
the generation of the final model.
6.3.2.1 Base Model Results
For the D/2010-2019 dataset, the initial set contains 40,00,724 features when we consider
all the categories. The detection results on evaluation set using Extra Tree classifier are -
• Acc - 97.63
• Prec - 98.04
• Rec - 97.21
• F1 - 97.62
When considering only the 7 relevant categories, the initial set contains 1,76,656 features.
The detection results on evaluation set are -
• Acc - 98.0
• Prec - 98.21
• Rec - 97.79
• F1 - 98.0
6.3.2.2 Individual Category Results
For D/2010-2012 dataset, we identify the important features in each of the 7 category.
Table 6.33 shows the performance results for the 7 identified categories using initial and
final set of features.
Chapter 6. Analysis 70
I Acc Prec Rec F1 F Acc Prec Rec F1
API Packages 238 97.32 97.6 97.03 97.31 24 97.06 97.3 96.82 97.06
Opcodes 225 96.61 97.25 95.94 96.59 45 96.34 96.87 95.78 96.32
System Commands 209 91.25 94.78 87.32 90.9 45 91.05 94.54 87.16 90.7
API 78199 97.89 98.02 97.75 97.89 35 97.44 97.66 97.21 97.43
Requested Perm 66219 93.15 93.02 93.33 93.17 60 92.89 92.98 92.8 92.89
Intent Const 31567 90.23 88.7 92.23 90.43 60 89.61 94.5 84.14 89.02
Misc Feat 42 97.1 96.99 97.23 97.11 30 97.1 96.98 97.23 97.11
Table 6.33: Performance results of relevant categories on evaluation set
6.3.2.3 Combined Category Results
After identifying important features in each of the relevant categories that shows robust-
ness, generalised, efficient and better detection results, we explore the idea behind com-
bining the detection ability of each category. Table 6.34 shows the performance results on
several possible combinations of the 7 identified categories.
6.3.2.4 Results on Test set
Finally, we built an Android malware detection model using samples from D/2010-2019.
The model has been created based on the identified important features. The performance
of this model on different test sets is shown in Table 6.35. The performance results on
different evaluation metrics and for all the tests sets are consistent. This suggests that
the built model is robust and sustainable. It is capable of effective detection of Android
malware.
6.3.2.5 Effectiveness of Identified Features
The identified features are the features that are most informative and important for mal-
ware detection for any set of samples. To demonstrate the effectiveness of identified fea-
tures for Android malware detection, we train two models using D/DREBIN dataset train
Chapter 6. Analysis 71
set and D/AMD dataset train set and show the effectiveness of these models using their
respective test sets. Table 6.36 shows the performance results of these two models.
Summary: We have analysed several categories of features. Our analysis shows that some
categories require many features to show good detection results and some require very less
number of features. Some categories shows good detection for samples within a certain
time frame only and some shows good results for only certain evaluation metrics. Therefore
to build a light weight model that will sustain over time, we have analysed and identified
the only relevant feature and built a detection model based on it. The performance results
of this model on various test sets suggests that this model is robust and will sustain over
time. Also, to show the effectiveness of the identified features, we have experimented on
two more datasets. The performance results of the corresponding models built on these
datasets shows that these set of features results in a effective malware detection model.
Chapter 6. Analysis 72
Categories F Acc Prec Rec F1
RP + IC + API 153 97.98 97.87 98.1 97.99
RP + IC + AP 142 97.97 97.94 98.01 97.98
IC + API + M 123 97.91 97.97 97.86 97.92
RP + IC + API + AP 177 98.05 97.96 98.16 98.06
RP + IC + API + SC 198 98.04 97.92 98.17 98.04
RP + IC + API + M 183 97.99 97.88 98.11 98.0
RP + IC + AP + SC 187 97.99 97.92 98.06 97.99
RP + IC + AP + M 172 97.99 97.95 98.04 98.0
RP + IC + SC + OP 208 97.87 97.87 97.89 97.88
RP + IC + SC + M 193 97.89 97.77 98.02 97.9
RP + API + AP + SC 162 97.89 97.77 98.03 97.9
RP + API + AP + M 147 97.89 97.81 97.99 97.9
RP + API + SC + M 168 97.88 97.75 98.02 97.88
IC + API + AP + SC 162 97.86 98.01 97.7 97.85
IC + API + AP + M 147 97.99 98.11 97.87 97.99
IC + API + SC + M 168 97.91 97.95 97.88 97.91
IC + API + OP + M 168 97.88 98.01 97.75 97.88
IC + AP + SC + M 157 97.86 98.04 97.69 97.86
RP + IC + API + AP + SC 222 98.1 98.02 98.19 98.1
RP + IC + API + AP + OP 222 98.07 98.06 98.09 98.07
RP + IC + API + AP + M 207 98.12 98.06 98.2 98.13
RP + IC + API + SC + OP 243 98.02 97.93 98.12 98.03
RP + IC + API + SC + M 228 98.03 97.91 98.16 98.03
RP + IC + API + OP + M 228 98.01 97.97 98.05 98.01
RP + IC + AP + SC + OP 232 97.98 97.96 98.0 97.98
RP + IC + AP + SC + M 217 98.02 98.0 98.05 98.02
RP + IC + AP + OP + M 217 97.99 98.02 97.96 97.99
RP + IC + SC + OP + M 238 97.95 97.9 98.0 97.95
RP + API + AP + SC + OP 207 97.9 97.83 97.98 97.91
RP + API + AP + SC + M 192 97.93 97.86 98.0 97.93
IC + API + AP + SC + M 192 98.02 98.12 97.92 98.02
IC + API + AP + OP + M 192 97.92 98.1 97.74 97.92
RP + IC + API + AP + SC + OP 267 98.08 98.07 98.11 98.09
RP + IC + API + AP + SC + M 252 98.1 98.03 98.18 98.1
RP + IC + API + SC + OP + M 273 98.02 97.95 98.1 98.03
IC + API + AP + SC + OP + M 237 97.95 98.07 97.84 97.95
RP + IC + API + AP + SC + OP + M 297 98.11 98.08 98.14 98.11
Table 6.34: Performance results using combined categories
Chapter 6. Analysis 73
Test Set Acc Prec Rec F1 TN TP FN FP
D/2010-2012 98.06 97.77 98.40 98.06 37,841 38,105 617 883
D/2013-2015 98.72 98.81 98.62 98.72 36,585 36,526 510 437
D/2016-2019 97.65 97.52 97.78 97.65 6525 6540 148 166
Table 6.35: Performance results of final model on different test sets
Test Set Acc Prec Rec F1 TN TP FN FP
D/AMD 99.61 99.61 99.60 99.60 6,161 5,980 24 23
D/DREBIN 98.08 98.88 97.22 98.04 1,379 1,330 38 15
Table 6.36: Performance results to show effectiveness of identified features
Chapter 7
Conclusion and Future Work
7.1 Conclusion
In this thesis, we have built a light weight malware detection model that is capable of fast,
generalized, accurate and efficient detection of Android malware. To fulfill this objective,
we have designed a framework that deals with the challenges faced in Android malware
detection. We have worked with more than 8 lac’s of samples that are spread out over
a period of 2010 to 2019. We have created multiple datasets to perform parallel analysis
for more robustness. Unlike previous datasets, the datasets we work on are highly class
balanced and therefore it is expected that machine learning models should train better on
them. We have extracted several categories of information like permissions, API, Intents,
App Components etc. and analysed their effectiveness towards Android malware detection.
We have used multiple evaluation metrics for a proper measure of performance results.
Then with the aim of creating a light weight detection model, we have implemented three
feature selection techniques and identified the only relevant features - these are the features
that are the most informative and important for malware detection for samples across
different years. Finally, we have identified the relevant sets of features and built a model
capable of temporally robust detection of Android malware.
7.2 Future Work
In this work, we have only analysed features extracted using static analysis. More features
can also be extracted using dynamic analysis to create a more informative feature set.
74
Bibliography
[1] Allix, K., Bissyande, T. F., Klein, J., and Le Traon, Y. (2016). Androzoo: Collecting
millions of android apps for the research community. In 2016 IEEE/ACM 13th Working
Conference on Mining Software Repositories (MSR), pages 468–471. IEEE.
[2] Android (2018). Platform Architecture. https://developer.android.com/guide/
platform, [last accessed on 22 May 2019].
[3] Arp, D., Spreitzenbarth, M., Hubner, M., Gascon, H., Rieck, K., and Siemens, C.
(2014). Drebin: Effective and explainable detection of android malware in your pocket.
In Ndss, volume 14, pages 23–26.
[4] AV-TEST (May 2019). Total Malware - Av-test. https://www.av-test.org/en/
statistics/malware/, [last accessed on 10 May 2019].
[5] Bacci, A., Bartoli, A., Martinelli, F., Medvet, E., Mercaldo, F., and Visaggio, C. A.
(2018). Impact of code obfuscation on android malware detection based on static and
dynamic analysis. In ICISSP, pages 379–385.
[6] Darknet (November 2016). Androguard – Reverse Engineering & Mal-
ware Analysis For Android. https://www.darknet.org.uk/2016/11/
androguard-reverse-engineering-malware-analysis-for-android/, [last ac-
cessed on 11 May 2019].
[7] Desnos, A. et al. (2011). Androguard.
[8] Feizollah, A., Anuar, N. B., Salleh, R., Suarez-Tangil, G., and Furnell, S. (2017).
Androdialysis: Analysis of android intent effectiveness in malware detection. computers
& security, 65:121–134.
[9] Fereidooni, H., Conti, M., Yao, D., and Sperduti, A. (2016). Anastasia: Android
malware detection using static analysis of applications. In 2016 8th IFIP International
Conference on New Technologies, Mobility and Security (NTMS), pages 1–5. IEEE.
75
Bibliography 76
[10] Firdaus, A. and Anuar, N. (2015). Root-exploit malware detection using static analy-
sis and machine learning. In Proceedings of the fourth international conference on Com-
puter Science & Computational Mathematics (ICCSCM 2015). Langkawi, Malaysia,
pages 177–183.
[11] Gaviria de la Puerta, J. and Sanz, B. (2017). Using dalvik opcodes for malware
detection on android. Logic Journal of the IGPL, 25(6):938–948.
[12] Goel, D. (May 2018). Understanding Android Architecture. https://medium.com/
@deepamgoel/understanding-android-architecture-1f0fb4b52f90, [last accessed
on 11 May 2019].
[13] IDC (2019). Smartphone market share. https://www.idc.com/promo/
smartphone-market-share/os, [last accessed on 10 May 2019].
[14] Joshi, R. (September 2016). Interpretation of per-
formance measures. https://blog.exsilio.com/all/
accuracy-precision-recall-f1-score-interpretation-of-performance-measures/,
[last accessed on 20 May 2019].
[15] Kaushik, S. (December 2016). Introduction To Feature Selec-
tion Methods. https://www.analyticsvidhya.com/blog/2016/12/
introduction-to-feature-selection-methods-with-an-example-or-how-to-\
select-the-right-variables/, [last accessed on 20 May 2019].
[16] Li, C., Zhu, R., Niu, D., Mills, K., Zhang, H., and Kinawi, H. (2018). Android
malware detection based on factorization machine. arXiv preprint arXiv:1805.11843.
[17] Martın Garcıa, A., Lara-Cabrera, R., and Camacho, D. (2018a). Android malware de-
tection through hybrid features fusion and ensemble classifiers: The andropytool frame-
work and the omnidroid dataset. Information Fusion, 52.
[18] Martın Garcıa, A., Lara-Cabrera, R., and Camacho, D. (2018b). A new tool for static
and dynamic android malware analysis. pages 509–516.
[19] Neil (January 2019). An Overview of The Android Architecture. https://www.
techotopia.com/index.php/An_Overview_of_the_Android_Architecture, [last ac-
cessed on 20 May 2019].
[20] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O.,
Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A.,
Bibliography 77
Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. (2011). Scikit-learn:
Machine learning in Python.
[21] Protalinski, E. (May 2019). Android passes 2.5 billion monthly ac-
tive devices - Venturebeat. https://venturebeat.com/2019/05/07/
android-passes-2-5-billion-monthly-active-devices/, [last accessed on 10
May 2019].
[22] R, V. (September 2018). Feature selection - Cor-
relation and P-value. https://towardsdatascience.com/
feature-selection-correlation-and-p-value-da8921bfb3cf, [last accessed
on 20 May 2019].
[23] scikit (2019a). Extra Tree. https://scikit-learn.org/stable/modules/
generated/sklearn.ensemble.ExtraTreesClassifier.html, [last accessed on 22
May 2019].
[24] scikit (2019b). Neural Network. https://scikit-learn.org/stable/modules/
generated/sklearn.neural_network.MLPClassifier.html, [last accessed on 22 May
2019].
[25] scikit (2019c). Random Forest. https://scikit-learn.org/stable/modules/
generated/sklearn.ensemble.RandomForestClassifier.html, [last accessed on 22
May 2019].
[26] scikit (2019d). RFE. https://scikit-learn.org/stable/modules/generated/
sklearn.feature_selection.RFE.html, [last accessed on 19 May 2019].
[27] scikit (2019e). RFECV. https://scikit-learn.org/stable/modules/generated/
sklearn.feature_selection.RFECV.html, [last accessed on 19 May 2019].
[28] scikit learn (2019). SVM. https://scikit-learn.org/stable/modules/
generated/sklearn.svm.SVC.html, [last accessed on 22 May 2019].
[29] simoatze (2019). maline. https://github.com/soarlab/maline/tree/master/
data, [last accessed on 19 May 2019].
[30] Spyder (2018). Spyder - The Scientific Python Development Environment. https:
//www.spyder-ide.org/, [last accessed on 19 May 2019].
[31] Suarez-Tangil, G., Dash, S. K., Ahmadi, M., Kinder, J., Giacinto, G., and Cavallaro,
L. (2017). Droidsieve: afast and accurate classification of obfuscated android malware.
Bibliography 78
In Proceedings of the Seventh ACM on Conference on Data and Application Security
and Privacy, pages 309–320. ACM.
[32] Sun, L., Li, Z., Yan, Q., Srisa-an, W., and Pan, Y. (2016). Sigpid: significant permis-
sion identification for android malware detection. In 2016 11th international conference
on malicious and unwanted software (MALWARE), pages 1–8. IEEE.
[33] Thakran, S. (May 2017). New Android Malware Samples Found Ev-
ery 10 Seconds, Claims G Data. https://gadgets.ndtv.com/apps/news/
new-android-malware-samples-detected-every-10-seconds-report-1689991,
[last accessed on 20 May 2019].
[34] Wang, X., Wang, J., and Xiaolan, Z. (2016). A static android malwar detection
based on actual used permissions combination and api calls. International Journal of
Computer, Electrical, Automation, Control and Information Engineering, 10(9):1486–
1493.
[35] Wei, F., Li, Y., Roy, S., Ou, X., and Zhou, W. (2017). Deep ground truth analysis of
current android malware. In International Conference on Detection of Intrusions and
Malware, and Vulnerability Assessment (DIMVA’17), pages 252–276, Bonn, Germany.
Springer.
[36] Whittaker, Z. (March 2019). New Android adware found in 200 apps on Google Play
Store. https://techcrunch.com/2019/03/13/new-android-adware-google-play/,
[last accessed on 22 May 2019].
[37] Wikipedia (March 2019b). Mobile Malware. https://en.wikipedia.org/wiki/
Mobile_malware, [last accessed on 11 May 2019].
[38] Wikipedia (March 2019c). Statistical Classification. https://en.wikipedia.org/
wiki/Statistical_classification, [last accessed on 16 May 2019].
[39] Wikipedia (May 2019a). Android Application Package. https://en.wikipedia.
org/wiki/Android_application_package, [last accessed on 11 May 2019].
[40] Zhu, T. T. (2017). An Analysis of Mobile Malware evolution.