Phishing Detection: A Case Analysis on Classifiers with ...€¦ · (Thabtah and Abdelhamid, 2016)...
Transcript of Phishing Detection: A Case Analysis on Classifiers with ...€¦ · (Thabtah and Abdelhamid, 2016)...
Phishing Detection: A Case Analysis on Classifiers with Rules using Machine
Learning
Abstract
A typical predictive approach in data mining that produces If-Then knowledge for decision making is
rule based classification. Rule based classification includes a large number of algorithms that fall under
the categories of covering, greedy, rule induction, and associative classification. These approaches have
shown promising results due to the simplicity of the models generated and the user’s ability to
understand, and maintain them. Phishing is one of the emergent online threats in web security domains
that necessitates anti-phishing models with rules so users can easily differentiate among websites types.
This article critically analyses recent research studies on the use of predictive models with rules for
phishing detection, and evaluates the applicability of these approaches on phishing. To accomplish our
task, we experimentally evaluate four different rule based classifiers that belong to greedy, associative
classification and rule induction approaches on real phishing datasets and with respect to different
evaluation measures. Moreover, we assess the classifiers derived and contrast them with known classic
classification algorithms including Bayes Net, and Simple Logistics. The aim of the comparison is to
determine the pros and cons of predictive models with rules and reveal their actual performance when
it comes to detecting phishing activities. The results clearly showed that eDRI a recently greedy
algorithm not only generates useful models but these are also highly competitive with respect to
predictive accuracy as well as runtime when they are employed as anti-phishing tools.
Keywords- Classification, Data Mining, Machine Learning, Phishing, Rules, Rule based Classifiers,
Website Security
1. Introduction
Phishing normally involves creating a fake well-designed website that has identical similarity to an
existing trusted business website aiming to trick users and illegally obtain their credentials such as
login information (Abdelhamid, 2015). Phishers intend using users’ credentials so they can access
financial information including bank account numbers, credit card information, etc. (Afroz and
Greenstadt, 2011). Unfortunately, the consequences of phishing are fatal for users because they become
vulnerable to identity theft and information breaches (Nguyen, et al., 2015). Phishing usually occurs
through an email sent from trusted sources to online users urging them to adjust login information by
clicking a hyper link (Khadi and Shinde, 2014).
Since phishing is a typical classification problem then machine learning (ML) techniques seem
appropriate to derive knowledge from websites’ features that can help in minimising this problem
(Thabtah and Abdelhamid, 2016) (Mohammad, et al, 2012) (Abdelhamid, et al., 2012). The key to success
in developing automated anti-phishing predictive systems is the website’s features. There are
tremendous numbers of features linked with a website so a necessary step that can enhance the
predictive system performance is to pre-process the features in order to pick up the “most” effective
ones (Thabtah et al., 2016B). Features effectiveness can be measured using different computational
Fadi Thabtah
Nelson Marlbourough Institute of Technology,
Auckland, New Zealand
Firuz Kamalov
Canadian University of Dubai
Dubai, UAE
intelligence methods such as information gain (Quinlan, 1979), correlation analysis (Hall, 1999), chi
Square (Liu and Setiono, 1995), and others.
Once an initial features set is chosen then the classification algorithm can be applied on the selected
features to come up with the predictive system (Abdelhamid, et al., 2013) (Thabtah et al., 2012). There
are many ML algorithms for classification developed by scholars in the last two to three decades
(Witten, et al., 2016). Most of these algorithms use one of the following major classification approaches
in deriving the predictive systems:
1) Decision trees (ID3) (Quinlan, 1979) and its successors
2) Probabilistic models
3) Rule based classification
a. Associative classification and Class Association Rule (Thabtah and Hammoud, 2013)
b. Rule induction
c. Covering or greedy classification (Thabtah, et al., 2011)
d. Decision rules (C4.5-Rules) (Quinlan, 1993)
e. Fuzzy Logic (FL) rules
4) Neural Networks (NN)
5) Support Vector Machine (SVM)
6) Others (Bagging, Self-Organization Map, Instance based learning, etc.)
Among the aforementioned classification approaches, this paper focuses on rule based
classification systems since we believe that these techniques are more suitable as anti-phishing tools.
The reasons behind our preference of rule based classifications are attributed to the following:
1) The content of the models derived is straightforward human knowledge that novice users can
easily understand and apply when necessary.
2) Often the knowledge is formed as “If-Then” rules in which the antecedent of the rule (If part)
consists of a conjunction of features values and the consequent (Then part) contains the target
attribute value (Website type). These rules are easy to be controlled by the users.
3) Rule based classification techniques have proven merits in accurately predicting target values
in many domains including medical diagnoses, stock market analysis, email classification, text
categorisation and others.
This paper critically analyses recent rule based classification studies related to the website phishing
problem. We show how these approaches derive the predictive anti-phishing systems and their pros
and cons. specifically, a comprehensive section that contrasts common rule based ML techniques will
be included to highlight why this approach may serve as an intelligent anti-phishing tool. We believe
that there are limited attention to model anti-phishing models based on rules in ML and web security.
There are research attempts in associative classification such as (Abdelhamid, 2015) and (Abdelhamid
and Thabtah, 2014). However, covering and rule induction approaches are rarely applied to the website
phishing problem, i.e. (Mohammad, et al., 2013A) and (Thabtah, et al., 2016A). Therefore, one of the
aims of this paper is to reveal efficiency and effectiveness of approaches that generate models with rules
in combating phishing.
In experimental analysis, a real phishing dataset consists of over eleven thousand websites and
thirty features are utilised (Mohammad, et al., 2015). The websites have been collected using an online
scripting tool and the analysis will primarily focus on detection rate, the numbers of rules induced and
the time taken to construct the predictive models. Moreover, we compare the performance of classifiers
that produce rules with other common classic algorithms in classification including probabilistic, and
Simple Logistic. Lastly, the experiments also show the role of feature selection and its impact on the
phishing detection rate and the dimensionality of the dataset. The paper attempts to answer whether
classifiers with rules can deal effectively with phishing classification problem? And if so which learning
approach potentially serves as anti-phishing tool and why?.
This paper is structured as follows: Common predictive models that have been applied recently to
the phishing problem are critically analysed in Section 2 along with the phishing problem and the
phishing phases. Section 3 is devoted to experiments and results analysis on real dataset and using
large numbers of classifiers, four of which are rule based and two of which are classic algorithms.
Finally, the conclusions and recommendations for further research are given in Section 4.
2. Literature Review: Common Rule based Classifiers for Phishing
2.1 Phishing Problem and Its Life cycle
The website phishing problem usually contains a target attribute (class) with two values since each website
can be either phishy or legitimate therefore this problem can be considered a predictive classification task in
ML. In classification, a predictive model is constructed from labelled historical data in order to forecast a
target attribute in an unseen dataset. Similarly, in phishing, a number of website features can be obtained
from difference sources such as Phishtank then pre-processed. Once the data is processed, an intelligent
algorithm based on data mining or ML can be used for data processing in order to derive the anti-phishing
model (classifier). This classifier is then integrated into a web browser and utilised to predict any potential
website type browsed by the user. The browsed website is basically the test data example. In most related
studies, the phishing problem was treated as a binary classification where only the class has two values i.e.
phishy or legitimate. However, recent research studies such as (Abdelhamid, et al., 2014) dealt with the
phishing as a multi-class problem by adding a new class, i.e. suspicious.
Phishing incidents often occur through emails sent to online users urging them to update their
information (Mohammad et al., 2013B). There are other ways to initiate a phishing attack including
online blogs, Instant Messages (IMs), peer-to-peer file sharing services, online forums, social media
websites among others (Abdelhamid, 2015). Figure 1 depicts the process of starting a phishing attack
and the phases involved:
1) A hyperlink in an email or IM is sent along with a text message to the user urging them to
update their account information
2) When the user clicks on the hyperlink, he will be redirect to a fake website that replicates an
authenticated website the user normally encounters.
3) When the user tried to login he becomes vulnerable for phishing and his credentials will be
sent to a key logger or a server
4) Once the phishers obtain the user’s credential that user will be subject to online fraud.
Figure 1. Phishing steps (Abdelhamid et al., 2014)
Phisher:
Initializes
attack
Phisher:
Creates
Phishy email
Phisher:
Sends
Phishy Email
User:
Receives
Phishy email
User:
Discloses
information
Phisher:
Defrauding
Successful
Early Phishing Phase Mid Phishing Phase Post Phishing Phase
AntiPhishing Tool:
Detects and prevents
2.2 Decision Tree Rules
Decision trees is a classification approach that was known after the dissemination of ID3 algorithm
(Quinlan, 1979). A decision tree can be constructed based on the available variables inside the training
dataset and using Information Gain (IG) (Equation 1). The tree is built by initially selecting the variable
with the highest computed IG among all available variables in the training dataset as a root node. The
IG is computed using the Entropy (Equation 2) which denotes how informative a variable is in splitting
the training instances according to the target variable (class). The purer a variable can split the training
instances with reference to the class values, the higher the score. After choosing the root note the
algorithm repeatedly computes the IG for the remaining variables excluding the root until the tree
cannot be divided any further or all training instances for a variable belong to one class. A tree
constructed by ID3 can be transformed into a rules set in which a path in that tree linking the root node
to any leaf makes a rule.
Gain (T, f) = Entropy (T) - ∑((|𝑇𝑓| / | 𝑇 |) ∗ Entropy T𝑓)) (1)
where
Entropy (T) = ∑cc PP 2log
(2)
where
vp = Probability that T belongs to class l.
Tf = Subset of T for which feature F has value fa
|Tf| = Number of examples in Tf , and |T| = Size of T.
(Fette, et al., 2007) investigated phishing utilising C4.5-Rules algorithm (Quinlan, 1993) and then
compared the results obtained with classic ML techniques including Random Forest, SVM and Naïve
Bayes. Experiments on a set of 860 phishy and 695 ham emails were conducted. Various features for
distinguishing phishing emails have been identified, i.e. IP URLs, time of space, HTML messages,
number of connections inside the email, “JavaScript" and others. The authors reasoned that ML
methods can be improved with respect to predictive performance when grouping identified variables
inside the resulting classifiers.
2.3 Rule Induction
Covering methods such as Dynamic Rule Induction (DRI) (Qabajeh, et al., 2015), Enhanced Dynamic
Rule Induction (eDRI) (Thabtah, et al., 2016A) and PRISM (Cendrowska, 1987) derive the rules one by
one from the training dataset. Normally these algorithms divide the training dataset into subsets in
which each subset represents a target attribute (class). The instances in one subset are considered
positive to the subset’s class and negative for all other class labels. The covering algorithm is then
applied on each subset in which an empty rule is created, i.e. “If Empty then class1”. The algorithm
keeps adding attribute values to current rule normally, using a statistical measure such as rule’s
expected accuracy (Equation 3). Building the current rule stops according to a certain condition. For
instance, PRISM algorithm stops building the rule when the rule’s expected accuracy cannot be further
improved, i.e. reaches 100%. Whereas DRI stops building the rule when its frequency reaches a
predefined threshold called Rule_freq. Once a rule is constructed then all of the training data instances
covered by it are removed. The covering algorithm then starts building up a new rule from the same
subset until that subset gets empty then the algorithm moves into the next in-line class data (subset)
and so on. The termination condition of the covering algorithms usually occurs when the training
dataset has no more data instances or when no rules with acceptable accuracy can be found.
(P/T) (3)
Where P = the # of positive instances covered by a rule r (both antecedent and consequent)
T= the total # of instances covered by r’s antecedent
One of the latest covering algorithms that has been applied as an anti-phishing tool was eDRI
(Thabtah, et al., 2016A). eDRI is considered a dynamic algorithm that mines datasets using two main
thresholds named frequency and Rule_Strength. This algorithm scans the training dataset and only
stores “strong” feature values; features with frequencies larger than the minimum frequency threshold.
These strong features are the only ones that can be part of a rule and all other features are removed
during the first data scan. Once a rule is discovered, eDRI not only removes its training instances but
also updates the strong features frequencies to reflect that deletion making this algorithm performs a
natural pruning that often leads to controllable predictive systems. eDRI was applied on a binary real
phishing dataset that contains websites collected from different sources. The results obtained from
eDRI were highly competitive with respect to the phishing detection rate. This covering algorithm was
able to outperform other types of classification systems generated by decision trees and associative
classification. Moreover, the classification systems generated by eDRI have shown interesting
knowledge helpful for users and the security experts to design an information security strategy.
Rule induction is a more sophisticated classification approach than covering that adds excessive
pruning to further reduce the size of the resulting classification systems (Abdelhamid and Thabtah,
2014) (Thabtah et al., 2011). One of the common rule induction approaches is RIPPER (Cohen, 1995)
which is similar to PRISM in the way of splitting the input dataset. However, RIPPER invokes the rule
growing phase function. In this phase, the algorithm starts with data instances that belong to the
minority class (the least frequent class in the training dataset), it constructs a rule by appending features
values to the rule’s body until the rule error becomes as small as possible, i.e. zero. While the rule is
getting constructed, RIPPER employs a special pruning procedure to eliminate unnecessary features
that have no significance when appended to the current rule’s body. RIPPER terminates the rule
building process when the rule’s error reaches 50% or using the minimum description length (MDL)
principle (Hall et al., 2009). The rule gets derived when adding a feature to its body generates an error
larger than the error obtained before adding that feature value.
(Khadi and Shinde, 2014) studied the problem of email phishing and proposed a potential solution
based on combining the RIPPER classifier with fuzzy logic (FL). The role of FL is to pick the main
features of the email and rank them based on a probability score. On the other hand, the role of RIPPER
is to automatically use these features to classify the type of emails into ham or phishy. Two components
of the email have been used, i.e. email message (spelling errors, embedded link) and URL (IP address,
Length, Long URL, Suffix_Prefix, Crawler URL, Non matching URL). Moreover, very limited data
consisting of 100 instances from phishtank have been used in the experiments in WEKA software tool
(Hall et al., 2009). No comparison with other FL or rule based classification was conducted by the
authors. The results showed that there are twelve rules generated by RIPPER from the dataset with a
85.4% prediction rate.
(Aburrous, et al., 2010) investigated data mining methods to seek their applicability in categorising
websites based on phishing features. The website’s features were manually classified into six criteria as
described in an earlier published work on phishing (Aburrous, et al., 2008). Then using WEKA, a
number of experiments with four classification algorithms have been conducted against a small dataset
of 1006 instances downloaded from phishtank. The base of the experiments is the classification accuracy
of the classifiers produced. The results revealed that CBA algorithm (Liu et al., 1998) is a promising
classifier that was able to detect on average 83% of phishing websites. The authors suggested that
results obtained can be further enhanced when careful feature selection is employed.
2.4 Associative Classification
Associative classification is an approach that employs association rule discovery methods to generate
classifiers from datasets (Thabtah et al., 2015) (Mohammad, et al., 2015). This approach normally relies
on two thresholds: the user minimum support and the minimum confidence. The support of a feature
value is basically how many times that feature value has appeared in the training dataset whereas the
frequency of the feature value plus the target class together denote the confidence of that feature value.
Any feature value with a frequency larger than the minimum support becomes a frequent feature. The
aim of the associative classification algorithm is to discover all frequent feature values and from those
the classification system is built, in which any frequent feature value with confidence larger than the
minimum confidence becomes a rule. One advantage of associative classification is the low error rate
classification system derived but a noticeable drawback is the very large numbers of rules derived
which indeed confines its usage in real applications.
Two associative classification methods named CBA and MCAR have been evaluated on a phishtank
dataset to seek their applicability in cracking phishing (Abourrous et al., 2010). The authors used a
dataset consisting of over 1000 instances with 27 different features and applied CBA and four other
classifiers using the WEKA tool. The aim was to assist security managers within organisations in
building an intelligent anti-phishing tool within browsers that can detect phishing as accurately as
possible. Experimental results of the six ML algorithms revealed that associative classification methods
generate more rules than the rest of the algorithms yet higher predictive classifiers. More specifically,
the associative classification systems produced showed high correlations among features linked with
three major criteria (URL, Domain Identity, Encryption). Nevertheless, the massive number of rules
derived by CBA may overwhelm end-users since they will not be able to control the anti-phishing
system. Further, the authors have not implemented the associative classification rules within a browser
to evaluate its real performance and thus it will be hard to measure the success or failure of their
classification systems.
Recently, more domain specific associative classification anti-phishing systems were created by
(Abdelhamid, et al., 2014) (Abdelhamid, 2015). The new models take into account not only two class
values of the phishing problem (legitimate, phishy) but also consider a harder case to detect which is
“suspicious” class label. Instances that cannot be fully phishy nor legitimate are very hard to detect by
typical ML algorithms and thus increases their false positive rates. Hence these authors enhanced
current intelligent classification systems by including two distinct advantages:
1) Extending the phishing problem to include suspicious cases which makes it more realistic
2) Proposing a new multi-label learning phase that can discover disjunctive rules besides the
conjunctive rules. These additional disjunctive rules are tossed out by existing associative
classification methods. The new multi-label phase enhances predictive power and provides
more useful knowledge to end-user.
The authors have used a dataset that has sixteen features and over 2000 instances and compared the
performance of their classifiers with other traditional classifiers with respect to knowledge derived and
accuracy. Before experimentations, the authors have employed a chi-square testing method to measure
the features goodness and to discriminate among features with respect to their impact on phishing.
Processed data results showed a high competitive performance of the new multi-label associative
classifiers when compared with other existing algorithms such as CBA, and decision trees.
3. Experimental Results
3.1 Experimental Settings and Data
The experiments have been performed using a recently published security dataset that belongs to the
author. The dataset has been published at the University of Irvine data repository (Litchman, 2013) and
consists of thirty website features and the target variable (class). The size of the security dataset includes
more than 11000 examples in which each example represents a website that can be either legitimate or
phishy and therefore the majority of the existing features are either binary or multi values, i.e. ternary.
The available website’s features in the dataset were extracted using a PHP script that was embedded in
the web browser and applied on phishtank and legitimate websites between the middle of 2013 and
late 2015. The data instances have been collected from Yahoo directory and Phishtank repositories.
Before collecting the websites, and for each feature, a hand crafted rule was designed and then coded
in PHP. For example, for the IP Address, a rule was designed that maps a website to phishy class when
the IP address appears in the URL. Another rule examines the URL length and assigns a website phishy
status when the URL length exceeds 75 digits. Further details on the complete description of features
and their hand crafted rules can be found in(Mohammad, et al., 2015).
Table 1 shows twelve sample websites using only eight features of the dataset for presentation
purposes. For instance, in the first column (IP Address) a website is given 1 for this feature if there was
no IP Address appearing in the URL otherwise a -1 is assigned. The possible values assigned to the
features have been adopted from handcrafted rules developed in (Mohammad, et al., 2015). The
possible values (-1, 1, 0) correspond to phishy, legitimate and suspicious values. In Table 1, three
features are linked with ternary values, i.e. (URL_Length, URL_of_Anchor, Links in Tags) and the rest
are assigned binary values (1, -1). The last column in the table represents the class and are based on the
sample twelve websites. Four are identified legitimate websites and eight are phishy. The class values
have been assigned based on the possible features’ values per website and using the hand crafted
features rules.
We have used the WEKA (Hall et al., 2009) ML tool to run the experiments. WEKA is short for
Waikato Environment for Knowledge Analysis and it is a free ML Java platform that contains various
different implementations of knowledge discovery algorithms, filtering methods and visualisation
techniques. To ensure validity of the predictive models results ten-fold-cross validation was utilised
during the training phase of all classification algorithms considered in the experiments. The computing
machine used during the experiments has a 2.5 Ghz processor.
A number of rule based classification algorithms have been chosen to measure the effectiveness of
this family of algorithms on detecting phishing. In particular, eDRI (Thabtah, et al., 2016A), RIPPER
(Cohen, 1995), C4.5-Rule (Quinlan, 1993), and RIDOR (Gaines and Compton, 1995) have been chosen.
We would like to evaluate the predictive power as well as the predictive model content on the problem
of website phishing classification. We also compared rule based classification with known classic non-
Table 1. Sample of twelve websites from the dataset and for eight features
IP address
URL
Length
Shortening
Service
At
symbol HTTPS
Request
URL
URL of
Anchor
Links in
Tags
Class
(1- Legitimate,
-1 Phishy)
-1 1 1 1 -1 1 -1 1 -1
1 1 1 1 -1 1 0 -1 -1
1 0 1 1 -1 1 0 -1 -1
1 0 1 1 -1 -1 0 0 -1
1 0 -1 1 1 1 0 0 1
-1 0 -1 1 -1 1 0 0 1
1 0 -1 1 1 -1 -1 0 -1
1 0 1 1 -1 -1 0 -1 -1
1 0 -1 1 -1 1 0 1 1
1 1 -1 1 1 1 0 1 -1
1 1 1 1 1 -1 0 0 1
1 1 -1 1 1 1 -1 -1 -1
rule based ML algorithms. To be exact, we used probabilistic Bayes Net (Bouckaert, 2004), and Simple
Logistic (Sumner, et al., 2005). The bases of the comparison are:
1) Predictive power measured using one error rate
2) Runtime measured using the time taking to construct the predictive models
3) Predictive model content for decision makers at least for rule based classifiers
Finally, two major experiments have been conducted; one that does not consider feature filtering and
one with feature filtering. We decided to use the Correlation Feature Set (CFS) filtering method (Hall,
1999) due to the fact it reduces feature to feature correlation and maximises feature-to-class correlation.
As a matter fact, CFS usually reduces the dimensionality of the input dataset significantly when
compared with other known filtering and wrapping methods in the literature such as Information Gain
and Chi Square and without drastically hindering the predictive model’s performance.
3.2 Results Analysis
Figures 2a & 2b show two important evaluation measures; the error rate generated by the rule based
predictive models and the time taken to build them in mille seconds (ms). It is obvious from Figure 2a
that C4.5-Rules models have outperformed the remaining classifiers. Particularly, C4.5-Rules achieved
0.86%, 3.03%, and 3.33% higher percentages of accuracy than RIPPER, RIDOR and eDRI algorithms
respectively. Nevertheless, C4.5-Rules models are very large with respect to the numbers of rules
derived when compared with the remaining rule based classifiers, which may limit its use in real
domain applications including phishing (see Figure 3). As a matter fact, C4.5-Rules discovered way
more rules than its counterparts. To be exact, there was 140, 130 and 144 additional rules generated by
c4.5-Rules when compared with RIPPER, eDRI and RIDOR respectively. This additional knowledge
may contribute to increasing predictive performance but indeed resulted in uncontrollable classifiers
that may overwhelm end-users. Overall, classification algorithms that derived rule based classifiers
were able to produce predictive models with good accuracy and maintainable sets of rules except C4.5-
Rules.
The fastest algorithm in building predictive models among the rule based classifiers was eDRI. This
algorithm has a smart rule induction strategy that guarantees removing data instances connected with
each generated rule on the fly and then amends the positions of the remaining potential rules.
Amending the rank of the rules based on their frequencies whenever a rule is generated allows for
Figure 2a Error rate required to build the rule based predictive
models of the considered algorithms
0.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
C4.5-Rules RIPPER eDRI Ridor
Figure 2b Run time in ms required to build the rule based
predictive models of the considered algorithms
0.00
0.50
1.00
1.50
2.00
2.50
3.00
3.50
4.00
C4.5-Rules RIPPER eDRI Ridor
differentiating weak rules before the rules evaluation phase kicks in which consequently minimises the
use of computing resources. eDRI derives models quicker than most of the other rule based classifiers
and this is obvious from Figure 2b, which makes this algorithm suitable as an anti-phishing method.
Another additional reason for the suitability of eDRI is the fact that it has produced the least number
of rules (see Figure 3) yet maintained an acceptable predictive power performance. A slight decrement
in accuracy in an exchange with maintainable set of rules can be tolerated.
To seek further advantages and disadvantages of rule based classifiers for phishing detection, we
have compared the four algorithms with common classic predictive classifiers such as probabilistic
Bayes Net (Bouckaert, 2004), and Simple Logistic (Sumner, et al., 2005). Since the latter two classifiers
do not produce rules then the bases of the comparison were error rate and time taken to construct the
predictive models in ms.
Figures 4a & 4b depict the error rate and the runtime in ms for the six classification algorithms. The
results clearly indicate that rule based classifiers are very competitive in predicting phishing websites
since RIPPER and C4.5 were the top two classifiers in regards to predictive accuracy measure. As a
matter of fact, when calculating the average error rate of the four rule based classifiers (C4.5-Rules,
eDRI, RIPPER, RIDOR) and comparing it with each classic classifier’s error rate the rule based classifiers
still dominate. The average error rate of the rule based classifiers is 5.93% and this is higher than those
of Bayes Net, and Simple Logistics by 1.08%, and 0.47% respectively. This is if limited a direct indication
that classifiers with rules are not only beneficial for decision makers but normally achieve a higher
predictive performance than traditional classification algorithms such as probabilistic, and simple
logistic at least for phishing detection of websites.
Figure 4b illustrates the runtime in ms taken to build the predictive models by the six classifiers. The
figure indicates that not only are rule based classifiers more predictive than other classification
algorithms but they are also very competitive with respect to the time taken to build the classifiers. The
fastest algorithms are eDRI and BayesNet. Bayes Net utilises simple calculations of the likelihood of
each class in the dataset based on probability theory and hence there is no pruning or rule discovery
involved. Hence, it is fast when deriving predictive models which explains the minimal time taken to
build the anti-phishing models of Figure 4.
We have looked into the features that have a high influence on the target class in order to seek a
more concise rule based model for phishing. To achieve this purpose, we employed CFS filtering (Hall,
1999) in order to remove redundant features in the early stages before building the models. The most
Figure 3 the number of rules generated by the rule based predictive models
0
20
40
60
80
100
120
140
160
180
C4.5-Rules RIPPER eDRI Ridor
effective phishing features that CFS chose are (Prefix_Suffix, having_Sub_Domain, SSLfinal_State,
Request_URL, URL_of_Anchor, Links_in_tags, SFH, web_traffic, Google_Index). All other features
have been discarded before we run the classifiers against the aforementioned features. Figure 5
demonstrates that despite the dimensionality reduction of the dataset where 21 features have been
removed still the predictive performance is sustained and only slightly dropped by less than 1%.
Moreover, rule based classifiers maintained good predictive performance when compared with classic
classifiers after pre-processing the phishing dataset. The average rule based classifiers’ error (C4.5-
Rules, eDRI, RIPPER, RIDOR) was slightly lower than that of Bayes Net and Simple Logistics. Overall,
feature selection has significantly cut down the number of unnecessary features without drastically
increasing the error rate; this was clear from Figure 5.
Finally, we investigated the content of the classifiers generated by C4.5-Rules, RIPPER, eDRI to seek
common useful knowledge for end-users. Based on the predictive models of these algorithms, it seems
that the ways the rule based algorithms induce the rules are different, which is reflected on their
classifiers’ content. For instance, RIPPER tends to focus on phishing class (class = -1) during rule
induction and hence it generated rules only for that class label and when none of these rules can be
applied RIPPER tends to assign class = 1, i.e. “legitimate”, to the test data. On the other hand, the RIDOR
algorithm discovers rules that belong to only legitimate websites and when none of these rules can be
applied to a test data it assigns class=-1. Thus, we think that the rules in both RIPPER and RIDOR
classifiers are imbalanced with respect to class label and therefore the knowledge derived for both
algorithms on the phishing problem are inadequate. On the other hand, eDRI algorithm derived anti-
phishing classifiers that contain nine rules with phishing class and sixteen rules with legitimate class.
Hence, both class labels are represented to a certain degree within eDRI classifier and the user can
distinguish between the classes based on the content of the classifier. Moreover, the number of rules in
eDRI when compared with RIDOR and RIPPER algorithms are smaller and thus maintaining eDRI
rules by the end-user is easier than RIDOR and RIPPER. Overall, we found out that the below rules are
substantial for the end-user in distinguishing phishing websites:
If URL_of_Anchor = -1 and SSL_Final_state = -1 Then website = -1
If URL_of_Anchor = -1 and SSL_Final_state = 0 Then website = -1
If Prefix_Suffix = 1 Then website= 1
Figure 4a Error rate required to build all predictive models of the
considered algorithms
0.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
C4.5-Rules RIPPER eDRI Ridor Bayes Net SimpleLogistic
Error rate %
Figure 4b Runetime in ms required to build all predictive models of the
considered algorithms
0
5
10
15
20
25
30
35
40
45
C4.5-Rules RIPPER eDRI Ridor Bayes Net SimpleLogistic
Run Time ms
4. Conclusions
Phishing normally involves creating a fake website that has identical similarity to an existing trusted
business website and aims to trick users and access their financial assets. One way to minimise the risks
associated with phishing is to build automated predictive models using rule based classification
techniques. This is since models that contain simple If-Then knowledge are favoured by end users in
understanding and controlling this serious web threat. Hence, this paper critically analysed and then
thoroughly experimented with a number of rules based classifiers and other non-rule based algorithms
on the phishing detection problem. In particular, we have evaluated the rule based classification
approach and its applicability on website phishing using a real dataset. This dataset was recently
published with the University of Irvine data repository. To achieve the aim, four rule based classifiers
that belong to different families of algorithms have been used (C4.5-Rules, RIPPER, eDRI, RIDOR)
along with other two non-rule classic classification algorithms (Probabilistic-Bayes Net, Simple
Logistic). The bases of comparison were detection error rate, time taken to build the predictive models
in ms and the content of the models. We have also taken features filtering into consideration and its
effect on phishing detection rate.
The experimental results against large phishing websites revealed that rule based classifiers are a highly
useful anti-phishing techniques since they derived moderate size models without hindering the
predictive accuracy performance. In particular, the average error of the rule based classifiers especially
eDRI was smaller than that of each classic classification algorithm. Moreover, eDRI was the fastest
method to generate predictive models among the rule based classifiers and this algorithm was highly
competitive Naïve Bayes. More importantly, when the correlation feature set filtering was employed
eDRI was barely impacted and the algorithm was able to produce anti-phishing models with less than
a 1% decrease in error rate by using just nine features instead of thirty features. Overall, all rule based
predictive models performed well with respect to detection rate and runtime which indicates the
suitability of these approaches in detecting phishing. In fact, the rules discovered by eDRI and RIPPER
algorithms, are influential in differentiating between websites since they can serve as decision tools for
end-user to fight phishing.
In the near future we intend to design rule pruning methods to further reduce the number of rules
derived by rule based predictive models.
Figure 5 Error rate derived by the considered algorithm after CFS filtering method was applied
0.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
9.00
C4.5-Rules RIPPER eDRI Ridor Bayes Net Simple Logistic
Error rate
References
1. Abdehamid N. (2015) Multi-label rules for phishing classification. Applied Computing and
Informatics 11 (1), 29-46.
2. Abdelhamid N., Thabtah F., (2014) Associative Classification Approaches: Review and
Comparison. Journal of Information and Knowledge Management (JIKM). Vol. 13, No. 3 (2014)
1450027.
3. Abdelhamid N., Thabtah F., Ayesh A. (2014) Phishing detection based associative classification
data mining. Expert systems with Applications Journal. 41 (2014) 5948–5959.
4. Abdelhamid N, Ayesh A., Thabtah F. (2013) Phishing Detection using Associative Classification
Data Mining. ICAI'13 - The 2013 International Conference on Artificial Intelligence, pp. (491-499).
USA.
5. Abdelhamid N., Ayesh A., Thabtah F. (2012) An Experimental Study of Three Different Rule
Ranking Formulas in Associative Classification Mining. Proceedings of the 7th IEEE International
Conference for Internet Technology and Secured Transactions (ICITST-2012), pp. (795-800), UK.
6. Aburrous M., Hossain M., Dahal K.P. and Thabtah F. (2010) Associative Classification techniques
for predicting e-banking phishing websites. Proceedings of the 2010 International Conference on
Information Technology, Las Vegas, Nevada, USA, 2010, pp. 176-181.
7. Aburrous M., Hossain A., Dahal K., Thabtah F. (2008) Intelligent Quality Performance Assessment
for E-Banking Security using Fuzzy Logic. Proceedings of the 7th IEEE International Conference on
Information Technology (ITNG 2008). Las Vegas, USA.
8. Afroz, & Greenstadt, R. (2011) PhishZoo: Detecting Phishing Websites by Looking at Them. In Fifth
International Conference on Semantic Computing (September 18- September 21). Palo Alto,
California USA, 2011. IEEE.
9. Bouckaert, R. R., (2004). Bayesian network classifiers in Weka. (Working paper series. University of
Waikato, Department of Computer Science. No. 14/2004). Hamilton, New Zealand: University of
Waikato.
10. Cendrowska, J. (1987) PRISM: An algorithm for inducing modular rules. International Journal of
Man-Machine Studies, Vol.27, No.4, 349-370.
11. Cohen, W.W., 1995. Fast Effective Rule Induction. In In Proceedings of the Twelfth International
Conference on Machine Learning. Tahoe City, California, 1995. Morgan Kaufmann.
12. Fette I., Sadeh N., Tomasic A. (2007) Learning to detect phishing emails. Proceedings of the 16th
international conference on World Wide Web. 649-656.
13. Gaines, B.R., Paul Compton, J. (1995) Induction of Ripple-Down Rules Applied to Modeling Large
Databases, Intell. Inf. Syst. 5(3):211-228
14. Hall M., Frank E., Holmes G., Pfahringer B., Reutemann P., Witten I. (2009) The WEKA Data Mining
Software: An Update; SIGKDD Explorations, Volume 11, Issue 1.
15. Hall M. (1999) Correlation-based Feature Selection for Machine Learning. Thesis, department of
computer science, Waikaito University, New Zealand.
16. Khadi A., Shinde S. (2014) Detection of phishing websites using data mining techniques.
International Journal of Engineering Research and Technology, Volume 2(12).
17. Liu, H. and Setiono, R. (1995) Chi2: Feature Selection and Discretization of Numeric Attribute.
Proceedings of the Seventh IEEE International Conference on Tools with Artificial Intelligence,
November 5-8, 1995, pp. 388.
18. Liu, B.; Hsu, W.; and Ma, Y. (1998) Integrating classification and association rule mining. In Proc.
1998 Int. Conf. Knowledge Discovery and Data Mining (KDD’98), 80–86.
19. Mohammad R., Thabtah F., and McCluskey L. (2013A) Intelligent Rule based Phishing Websites
Classification. IET Information Security, vol. 8, no. 3, pp. 153-160, July 2013-A.
20. Mohammad R., Thabtah F., and McCluskey L. (2015) Phishing Dataset. [Online].
http://eprints.hud.ac.uk/24330/
21. Mohammad R., Thabtah F., McCluskey L., (2013B) Neural Network based Algorithm for Web
Security. ICAI'13 - The 2013 International Conference on Artificial Intelligence. pp. (581-587). USA.
22. Mohammad R., Thabtah F., McCluskey L., (2012) An Assessment of FeaturesRelated to Phishing
Websites using an Automated Technique. Proceedings of the 7th IEEE International Conference
for Internet Technology and Secured Transactions (ICITST-2012). pp. (492-497), UK.
23. Qabajeh I., Thabtah F., Chiclana F. (2015) Dynamic Classification Rules Data Mining Method.
Journal of Management Analytics.Volume 2, Issue 3, pp. pages 233-253. Wiley.
24. Quinlan, J. (1993) C4.5: Programs for machine learning. San Mateo, CA: Morgan Kaufmann.
25. Quinlan, J. (1979) Discovering rules from large collections of examples: a case study. In Expert
Systems in the Micro-electronic Age. Edinburgh, 1979.
26. Sumner M., Frank E., Hall M. (2005) Speeding up logistic model tree induction Knowl. Discov.
Databases: Pkdd, 2005 (3721) (2005), pp. 675–683.
27. Thabtah F., Abdelhamid N. (2016) Deriving Correlated Sets of Website Features for Phishing
Detection: A Computational Intelligence Approach. Journal of Information & Knowledge
Management, 1650042.
28. Thabtah F., Qabajeh I.., Chiclana F. (2016A) Constrained dynamic rule induction learning. Expert
Systems with Applications 63, 74-85.
29. Thabtah F., Mohammad RM., McCluskey L. (2016B) A dynamic self-structuring neural network
model to combat phishing. Neural Networks (IJCNN), 2016 International Joint Conference, 4221-
4226. Canada.
30. Thabtah F. Hammoud S. Abdeljaber H. (2015) Parallel and Distributed Single and Multi-label
Associative Classification Data Mining Frameworks based MapReduce. Journal of Parallel
Processing Letter. Parallel Process. Lett. 25, 1550002 .World Scientific.
31. Thabtah F., Hammoud S.. (2013) MR-ARM: A MapReduce Association Rule Mining. Journal of
Parallel Processing Letter, 23 (3) 1-22, 1350012. World Scientific.
32. Thabtah F., Gharaibeh O., Al-zubaidy R. (2012) Arabic Text Mining for Rule based Classification.
Journal of Information and Knowledge Management (JIKM). Volume: 11, Issue: 1(2012) pp. 1-10.
WorldScinet.
33. Thabtah F., Gharaibeh O., Abdel-jaber H. (2011) Comparison of rule based classification techniques
for the Arabic text. Proceedings of the IEEE ISIICT conference, pp. 75-83. Amman, Jordan.
34. Thabtah F. (2007) A review of associative classification mining. The Knowledge Engineering
Review 22 (01), 37-65.
35. Witten I. H., Frank E., Hall M. A., Pal C. J. (2016) Data mining: practical machine learning tools and
techniques with Java implementations, Morgan Kaufmann, 2016.