Phishing Detection: A Case Analysis on Classifiers with ...€¦ · (Thabtah and Abdelhamid, 2016)...

Phishing Detection: A Case Analysis on Classifiers with Rules using Machine

Learning

Abstract

A typical predictive approach in data mining that produces If-Then knowledge for decision making is

rule based classification. Rule based classification includes a large number of algorithms that fall under

the categories of covering, greedy, rule induction, and associative classification. These approaches have

shown promising results due to the simplicity of the models generated and the user’s ability to

understand, and maintain them. Phishing is one of the emergent online threats in web security domains

that necessitates anti-phishing models with rules so users can easily differentiate among websites types.

This article critically analyses recent research studies on the use of predictive models with rules for

phishing detection, and evaluates the applicability of these approaches on phishing. To accomplish our

task, we experimentally evaluate four different rule based classifiers that belong to greedy, associative

classification and rule induction approaches on real phishing datasets and with respect to different

evaluation measures. Moreover, we assess the classifiers derived and contrast them with known classic

classification algorithms including Bayes Net, and Simple Logistics. The aim of the comparison is to

determine the pros and cons of predictive models with rules and reveal their actual performance when

it comes to detecting phishing activities. The results clearly showed that eDRI a recently greedy

algorithm not only generates useful models but these are also highly competitive with respect to

predictive accuracy as well as runtime when they are employed as anti-phishing tools.

Keywords- Classification, Data Mining, Machine Learning, Phishing, Rules, Rule based Classifiers,

Website Security

1. Introduction

Phishing normally involves creating a fake well-designed website that has identical similarity to an

existing trusted business website aiming to trick users and illegally obtain their credentials such as

login information (Abdelhamid, 2015). Phishers intend using users’ credentials so they can access

financial information including bank account numbers, credit card information, etc. (Afroz and

Greenstadt, 2011). Unfortunately, the consequences of phishing are fatal for users because they become

vulnerable to identity theft and information breaches (Nguyen, et al., 2015). Phishing usually occurs

through an email sent from trusted sources to online users urging them to adjust login information by

clicking a hyper link (Khadi and Shinde, 2014).

Since phishing is a typical classification problem then machine learning (ML) techniques seem

appropriate to derive knowledge from websites’ features that can help in minimising this problem

(Thabtah and Abdelhamid, 2016) (Mohammad, et al, 2012) (Abdelhamid, et al., 2012). The key to success

in developing automated anti-phishing predictive systems is the website’s features. There are

tremendous numbers of features linked with a website so a necessary step that can enhance the

predictive system performance is to pre-process the features in order to pick up the “most” effective

ones (Thabtah et al., 2016B). Features effectiveness can be measured using different computational

Fadi Thabtah

Nelson Marlbourough Institute of Technology,

Auckland, New Zealand

[email protected]

Firuz Kamalov

Canadian University of Dubai

Dubai, UAE

[email protected]

intelligence methods such as information gain (Quinlan, 1979), correlation analysis (Hall, 1999), chi

Square (Liu and Setiono, 1995), and others.

Once an initial features set is chosen then the classification algorithm can be applied on the selected

features to come up with the predictive system (Abdelhamid, et al., 2013) (Thabtah et al., 2012). There

are many ML algorithms for classification developed by scholars in the last two to three decades

(Witten, et al., 2016). Most of these algorithms use one of the following major classification approaches

in deriving the predictive systems:

1) Decision trees (ID3) (Quinlan, 1979) and its successors

2) Probabilistic models

3) Rule based classification

a. Associative classification and Class Association Rule (Thabtah and Hammoud, 2013)

b. Rule induction

c. Covering or greedy classification (Thabtah, et al., 2011)

d. Decision rules (C4.5-Rules) (Quinlan, 1993)

e. Fuzzy Logic (FL) rules

4) Neural Networks (NN)

5) Support Vector Machine (SVM)

6) Others (Bagging, Self-Organization Map, Instance based learning, etc.)

Among the aforementioned classification approaches, this paper focuses on rule based

classification systems since we believe that these techniques are more suitable as anti-phishing tools.

The reasons behind our preference of rule based classifications are attributed to the following:

1) The content of the models derived is straightforward human knowledge that novice users can

easily understand and apply when necessary.

2) Often the knowledge is formed as “If-Then” rules in which the antecedent of the rule (If part)

consists of a conjunction of features values and the consequent (Then part) contains the target

attribute value (Website type). These rules are easy to be controlled by the users.

3) Rule based classification techniques have proven merits in accurately predicting target values

in many domains including medical diagnoses, stock market analysis, email classification, text

categorisation and others.

This paper critically analyses recent rule based classification studies related to the website phishing

problem. We show how these approaches derive the predictive anti-phishing systems and their pros

and cons. specifically, a comprehensive section that contrasts common rule based ML techniques will

be included to highlight why this approach may serve as an intelligent anti-phishing tool. We believe

that there are limited attention to model anti-phishing models based on rules in ML and web security.

There are research attempts in associative classification such as (Abdelhamid, 2015) and (Abdelhamid

and Thabtah, 2014). However, covering and rule induction approaches are rarely applied to the website

phishing problem, i.e. (Mohammad, et al., 2013A) and (Thabtah, et al., 2016A). Therefore, one of the

aims of this paper is to reveal efficiency and effectiveness of approaches that generate models with rules

in combating phishing.

In experimental analysis, a real phishing dataset consists of over eleven thousand websites and

thirty features are utilised (Mohammad, et al., 2015). The websites have been collected using an online

scripting tool and the analysis will primarily focus on detection rate, the numbers of rules induced and

the time taken to construct the predictive models. Moreover, we compare the performance of classifiers

that produce rules with other common classic algorithms in classification including probabilistic, and

Simple Logistic. Lastly, the experiments also show the role of feature selection and its impact on the

phishing detection rate and the dimensionality of the dataset. The paper attempts to answer whether

classifiers with rules can deal effectively with phishing classification problem? And if so which learning

approach potentially serves as anti-phishing tool and why?.

This paper is structured as follows: Common predictive models that have been applied recently to

the phishing problem are critically analysed in Section 2 along with the phishing problem and the

phishing phases. Section 3 is devoted to experiments and results analysis on real dataset and using

large numbers of classifiers, four of which are rule based and two of which are classic algorithms.

Finally, the conclusions and recommendations for further research are given in Section 4.

2. Literature Review: Common Rule based Classifiers for Phishing

2.1 Phishing Problem and Its Life cycle

The website phishing problem usually contains a target attribute (class) with two values since each website

can be either phishy or legitimate therefore this problem can be considered a predictive classification task in

ML. In classification, a predictive model is constructed from labelled historical data in order to forecast a

target attribute in an unseen dataset. Similarly, in phishing, a number of website features can be obtained

from difference sources such as Phishtank then pre-processed. Once the data is processed, an intelligent

algorithm based on data mining or ML can be used for data processing in order to derive the anti-phishing

model (classifier). This classifier is then integrated into a web browser and utilised to predict any potential

website type browsed by the user. The browsed website is basically the test data example. In most related

studies, the phishing problem was treated as a binary classification where only the class has two values i.e.

phishy or legitimate. However, recent research studies such as (Abdelhamid, et al., 2014) dealt with the

phishing as a multi-class problem by adding a new class, i.e. suspicious.

Phishing incidents often occur through emails sent to online users urging them to update their

information (Mohammad et al., 2013B). There are other ways to initiate a phishing attack including

online blogs, Instant Messages (IMs), peer-to-peer file sharing services, online forums, social media

websites among others (Abdelhamid, 2015). Figure 1 depicts the process of starting a phishing attack

and the phases involved:

1) A hyperlink in an email or IM is sent along with a text message to the user urging them to

update their account information

2) When the user clicks on the hyperlink, he will be redirect to a fake website that replicates an

authenticated website the user normally encounters.

3) When the user tried to login he becomes vulnerable for phishing and his credentials will be

sent to a key logger or a server

4) Once the phishers obtain the user’s credential that user will be subject to online fraud.

Figure 1. Phishing steps (Abdelhamid et al., 2014)

Phisher:

Initializes

attack

Phisher:

Creates

Phishy email

Phisher:

Sends

Phishy Email

User:

Receives

Phishy email

User:

Discloses

information

Phisher:

Defrauding

Successful

Early Phishing Phase Mid Phishing Phase Post Phishing Phase

AntiPhishing Tool:

Detects and prevents

2.2 Decision Tree Rules

Decision trees is a classification approach that was known after the dissemination of ID3 algorithm

(Quinlan, 1979). A decision tree can be constructed based on the available variables inside the training

dataset and using Information Gain (IG) (Equation 1). The tree is built by initially selecting the variable

with the highest computed IG among all available variables in the training dataset as a root node. The

IG is computed using the Entropy (Equation 2) which denotes how informative a variable is in splitting

the training instances according to the target variable (class). The purer a variable can split the training

instances with reference to the class values, the higher the score. After choosing the root note the

algorithm repeatedly computes the IG for the remaining variables excluding the root until the tree

cannot be divided any further or all training instances for a variable belong to one class. A tree

constructed by ID3 can be transformed into a rules set in which a path in that tree linking the root node

to any leaf makes a rule.

Gain (T, f) = Entropy (T) - ∑((|𝑇𝑓| / | 𝑇 |) ∗ Entropy T𝑓)) (1)

where

Entropy (T) = ∑cc PP 2log

(2)

where

vp = Probability that T belongs to class l.

Tf = Subset of T for which feature F has value fa

|Tf| = Number of examples in Tf , and |T| = Size of T.

(Fette, et al., 2007) investigated phishing utilising C4.5-Rules algorithm (Quinlan, 1993) and then

compared the results obtained with classic ML techniques including Random Forest, SVM and Naïve

Bayes. Experiments on a set of 860 phishy and 695 ham emails were conducted. Various features for

distinguishing phishing emails have been identified, i.e. IP URLs, time of space, HTML messages,

number of connections inside the email, “JavaScript" and others. The authors reasoned that ML

methods can be improved with respect to predictive performance when grouping identified variables

inside the resulting classifiers.

2.3 Rule Induction

Covering methods such as Dynamic Rule Induction (DRI) (Qabajeh, et al., 2015), Enhanced Dynamic

Rule Induction (eDRI) (Thabtah, et al., 2016A) and PRISM (Cendrowska, 1987) derive the rules one by

one from the training dataset. Normally these algorithms divide the training dataset into subsets in

which each subset represents a target attribute (class). The instances in one subset are considered

positive to the subset’s class and negative for all other class labels. The covering algorithm is then

applied on each subset in which an empty rule is created, i.e. “If Empty then class1”. The algorithm

keeps adding attribute values to current rule normally, using a statistical measure such as rule’s

expected accuracy (Equation 3). Building the current rule stops according to a certain condition. For

instance, PRISM algorithm stops building the rule when the rule’s expected accuracy cannot be further

improved, i.e. reaches 100%. Whereas DRI stops building the rule when its frequency reaches a

predefined threshold called Rule_freq. Once a rule is constructed then all of the training data instances

covered by it are removed. The covering algorithm then starts building up a new rule from the same

subset until that subset gets empty then the algorithm moves into the next in-line class data (subset)

and so on. The termination condition of the covering algorithms usually occurs when the training

dataset has no more data instances or when no rules with acceptable accuracy can be found.

(P/T) (3)

Where P = the # of positive instances covered by a rule r (both antecedent and consequent)

T= the total # of instances covered by r’s antecedent

One of the latest covering algorithms that has been applied as an anti-phishing tool was eDRI

(Thabtah, et al., 2016A). eDRI is considered a dynamic algorithm that mines datasets using two main

thresholds named frequency and Rule_Strength. This algorithm scans the training dataset and only

stores “strong” feature values; features with frequencies larger than the minimum frequency threshold.

These strong features are the only ones that can be part of a rule and all other features are removed

during the first data scan. Once a rule is discovered, eDRI not only removes its training instances but

also updates the strong features frequencies to reflect that deletion making this algorithm performs a

natural pruning that often leads to controllable predictive systems. eDRI was applied on a binary real

phishing dataset that contains websites collected from different sources. The results obtained from

eDRI were highly competitive with respect to the phishing detection rate. This covering algorithm was

able to outperform other types of classification systems generated by decision trees and associative

classification. Moreover, the classification systems generated by eDRI have shown interesting

knowledge helpful for users and the security experts to design an information security strategy.

Rule induction is a more sophisticated classification approach than covering that adds excessive

pruning to further reduce the size of the resulting classification systems (Abdelhamid and Thabtah,

2014) (Thabtah et al., 2011). One of the common rule induction approaches is RIPPER (Cohen, 1995)

which is similar to PRISM in the way of splitting the input dataset. However, RIPPER invokes the rule

growing phase function. In this phase, the algorithm starts with data instances that belong to the

minority class (the least frequent class in the training dataset), it constructs a rule by appending features

values to the rule’s body until the rule error becomes as small as possible, i.e. zero. While the rule is

getting constructed, RIPPER employs a special pruning procedure to eliminate unnecessary features

that have no significance when appended to the current rule’s body. RIPPER terminates the rule

building process when the rule’s error reaches 50% or using the minimum description length (MDL)

principle (Hall et al., 2009). The rule gets derived when adding a feature to its body generates an error

larger than the error obtained before adding that feature value.

(Khadi and Shinde, 2014) studied the problem of email phishing and proposed a potential solution

based on combining the RIPPER classifier with fuzzy logic (FL). The role of FL is to pick the main

features of the email and rank them based on a probability score. On the other hand, the role of RIPPER

is to automatically use these features to classify the type of emails into ham or phishy. Two components

of the email have been used, i.e. email message (spelling errors, embedded link) and URL (IP address,

Length, Long URL, Suffix_Prefix, Crawler URL, Non matching URL). Moreover, very limited data

consisting of 100 instances from phishtank have been used in the experiments in WEKA software tool

(Hall et al., 2009). No comparison with other FL or rule based classification was conducted by the

authors. The results showed that there are twelve rules generated by RIPPER from the dataset with a

85.4% prediction rate.

(Aburrous, et al., 2010) investigated data mining methods to seek their applicability in categorising

websites based on phishing features. The website’s features were manually classified into six criteria as

described in an earlier published work on phishing (Aburrous, et al., 2008). Then using WEKA, a

number of experiments with four classification algorithms have been conducted against a small dataset

of 1006 instances downloaded from phishtank. The base of the experiments is the classification accuracy

of the classifiers produced. The results revealed that CBA algorithm (Liu et al., 1998) is a promising

classifier that was able to detect on average 83% of phishing websites. The authors suggested that

results obtained can be further enhanced when careful feature selection is employed.

2.4 Associative Classification

Associative classification is an approach that employs association rule discovery methods to generate

classifiers from datasets (Thabtah et al., 2015) (Mohammad, et al., 2015). This approach normally relies

on two thresholds: the user minimum support and the minimum confidence. The support of a feature

value is basically how many times that feature value has appeared in the training dataset whereas the

frequency of the feature value plus the target class together denote the confidence of that feature value.

Any feature value with a frequency larger than the minimum support becomes a frequent feature. The

aim of the associative classification algorithm is to discover all frequent feature values and from those

the classification system is built, in which any frequent feature value with confidence larger than the

minimum confidence becomes a rule. One advantage of associative classification is the low error rate

classification system derived but a noticeable drawback is the very large numbers of rules derived

which indeed confines its usage in real applications.

Two associative classification methods named CBA and MCAR have been evaluated on a phishtank

dataset to seek their applicability in cracking phishing (Abourrous et al., 2010). The authors used a

dataset consisting of over 1000 instances with 27 different features and applied CBA and four other

classifiers using the WEKA tool. The aim was to assist security managers within organisations in

building an intelligent anti-phishing tool within browsers that can detect phishing as accurately as

possible. Experimental results of the six ML algorithms revealed that associative classification methods

generate more rules than the rest of the algorithms yet higher predictive classifiers. More specifically,

the associative classification systems produced showed high correlations among features linked with

three major criteria (URL, Domain Identity, Encryption). Nevertheless, the massive number of rules

derived by CBA may overwhelm end-users since they will not be able to control the anti-phishing

system. Further, the authors have not implemented the associative classification rules within a browser

to evaluate its real performance and thus it will be hard to measure the success or failure of their

classification systems.

Recently, more domain specific associative classification anti-phishing systems were created by

(Abdelhamid, et al., 2014) (Abdelhamid, 2015). The new models take into account not only two class

values of the phishing problem (legitimate, phishy) but also consider a harder case to detect which is

“suspicious” class label. Instances that cannot be fully phishy nor legitimate are very hard to detect by

typical ML algorithms and thus increases their false positive rates. Hence these authors enhanced

current intelligent classification systems by including two distinct advantages:

1) Extending the phishing problem to include suspicious cases which makes it more realistic

2) Proposing a new multi-label learning phase that can discover disjunctive rules besides the

conjunctive rules. These additional disjunctive rules are tossed out by existing associative

classification methods. The new multi-label phase enhances predictive power and provides

more useful knowledge to end-user.

The authors have used a dataset that has sixteen features and over 2000 instances and compared the

performance of their classifiers with other traditional classifiers with respect to knowledge derived and

accuracy. Before experimentations, the authors have employed a chi-square testing method to measure

the features goodness and to discriminate among features with respect to their impact on phishing.

Processed data results showed a high competitive performance of the new multi-label associative

classifiers when compared with other existing algorithms such as CBA, and decision trees.

3. Experimental Results

3.1 Experimental Settings and Data

The experiments have been performed using a recently published security dataset that belongs to the

author. The dataset has been published at the University of Irvine data repository (Litchman, 2013) and

consists of thirty website features and the target variable (class). The size of the security dataset includes

more than 11000 examples in which each example represents a website that can be either legitimate or

phishy and therefore the majority of the existing features are either binary or multi values, i.e. ternary.

The available website’s features in the dataset were extracted using a PHP script that was embedded in

the web browser and applied on phishtank and legitimate websites between the middle of 2013 and

late 2015. The data instances have been collected from Yahoo directory and Phishtank repositories.

Before collecting the websites, and for each feature, a hand crafted rule was designed and then coded

in PHP. For example, for the IP Address, a rule was designed that maps a website to phishy class when

the IP address appears in the URL. Another rule examines the URL length and assigns a website phishy

status when the URL length exceeds 75 digits. Further details on the complete description of features

and their hand crafted rules can be found in(Mohammad, et al., 2015).

Table 1 shows twelve sample websites using only eight features of the dataset for presentation

purposes. For instance, in the first column (IP Address) a website is given 1 for this feature if there was

no IP Address appearing in the URL otherwise a -1 is assigned. The possible values assigned to the

features have been adopted from handcrafted rules developed in (Mohammad, et al., 2015). The

possible values (-1, 1, 0) correspond to phishy, legitimate and suspicious values. In Table 1, three

features are linked with ternary values, i.e. (URL_Length, URL_of_Anchor, Links in Tags) and the rest

are assigned binary values (1, -1). The last column in the table represents the class and are based on the

sample twelve websites. Four are identified legitimate websites and eight are phishy. The class values

have been assigned based on the possible features’ values per website and using the hand crafted

features rules.

We have used the WEKA (Hall et al., 2009) ML tool to run the experiments. WEKA is short for

Waikato Environment for Knowledge Analysis and it is a free ML Java platform that contains various

different implementations of knowledge discovery algorithms, filtering methods and visualisation

techniques. To ensure validity of the predictive models results ten-fold-cross validation was utilised

during the training phase of all classification algorithms considered in the experiments. The computing

machine used during the experiments has a 2.5 Ghz processor.

A number of rule based classification algorithms have been chosen to measure the effectiveness of

this family of algorithms on detecting phishing. In particular, eDRI (Thabtah, et al., 2016A), RIPPER

(Cohen, 1995), C4.5-Rule (Quinlan, 1993), and RIDOR (Gaines and Compton, 1995) have been chosen.

We would like to evaluate the predictive power as well as the predictive model content on the problem

of website phishing classification. We also compared rule based classification with known classic non-

Table 1. Sample of twelve websites from the dataset and for eight features

IP address

URL

Length

Shortening

Service

At

symbol HTTPS

Request

URL

URL of

Anchor

Links in

Tags

Class

(1- Legitimate,

-1 Phishy)

-1 1 1 1 -1 1 -1 1 -1

1 1 1 1 -1 1 0 -1 -1

1 0 1 1 -1 1 0 -1 -1

1 0 1 1 -1 -1 0 0 -1

1 0 -1 1 1 1 0 0 1

-1 0 -1 1 -1 1 0 0 1

1 0 -1 1 1 -1 -1 0 -1

1 0 1 1 -1 -1 0 -1 -1

1 0 -1 1 -1 1 0 1 1

1 1 -1 1 1 1 0 1 -1

1 1 1 1 1 -1 0 0 1

1 1 -1 1 1 1 -1 -1 -1

rule based ML algorithms. To be exact, we used probabilistic Bayes Net (Bouckaert, 2004), and Simple

Logistic (Sumner, et al., 2005). The bases of the comparison are:

1) Predictive power measured using one error rate

2) Runtime measured using the time taking to construct the predictive models

3) Predictive model content for decision makers at least for rule based classifiers

Finally, two major experiments have been conducted; one that does not consider feature filtering and

one with feature filtering. We decided to use the Correlation Feature Set (CFS) filtering method (Hall,

1999) due to the fact it reduces feature to feature correlation and maximises feature-to-class correlation.

As a matter fact, CFS usually reduces the dimensionality of the input dataset significantly when

compared with other known filtering and wrapping methods in the literature such as Information Gain

and Chi Square and without drastically hindering the predictive model’s performance.

3.2 Results Analysis

Figures 2a & 2b show two important evaluation measures; the error rate generated by the rule based

predictive models and the time taken to build them in mille seconds (ms). It is obvious from Figure 2a

that C4.5-Rules models have outperformed the remaining classifiers. Particularly, C4.5-Rules achieved

0.86%, 3.03%, and 3.33% higher percentages of accuracy than RIPPER, RIDOR and eDRI algorithms

respectively. Nevertheless, C4.5-Rules models are very large with respect to the numbers of rules

derived when compared with the remaining rule based classifiers, which may limit its use in real

domain applications including phishing (see Figure 3). As a matter fact, C4.5-Rules discovered way

more rules than its counterparts. To be exact, there was 140, 130 and 144 additional rules generated by

c4.5-Rules when compared with RIPPER, eDRI and RIDOR respectively. This additional knowledge

may contribute to increasing predictive performance but indeed resulted in uncontrollable classifiers

that may overwhelm end-users. Overall, classification algorithms that derived rule based classifiers

were able to produce predictive models with good accuracy and maintainable sets of rules except C4.5-

Rules.

The fastest algorithm in building predictive models among the rule based classifiers was eDRI. This

algorithm has a smart rule induction strategy that guarantees removing data instances connected with

each generated rule on the fly and then amends the positions of the remaining potential rules.

Amending the rank of the rules based on their frequencies whenever a rule is generated allows for

Figure 2a Error rate required to build the rule based predictive

models of the considered algorithms

0.00

1.00

2.00

3.00

4.00

5.00

6.00

7.00

8.00

C4.5-Rules RIPPER eDRI Ridor

Figure 2b Run time in ms required to build the rule based

predictive models of the considered algorithms

0.00

0.50

1.00

1.50

2.00

2.50

3.00

3.50

4.00


differentiating weak rules before the rules evaluation phase kicks in which consequently minimises the

use of computing resources. eDRI derives models quicker than most of the other rule based classifiers

and this is obvious from Figure 2b, which makes this algorithm suitable as an anti-phishing method.

Another additional reason for the suitability of eDRI is the fact that it has produced the least number

of rules (see Figure 3) yet maintained an acceptable predictive power performance. A slight decrement

in accuracy in an exchange with maintainable set of rules can be tolerated.

To seek further advantages and disadvantages of rule based classifiers for phishing detection, we

have compared the four algorithms with common classic predictive classifiers such as probabilistic

Bayes Net (Bouckaert, 2004), and Simple Logistic (Sumner, et al., 2005). Since the latter two classifiers

do not produce rules then the bases of the comparison were error rate and time taken to construct the

predictive models in ms.

Figures 4a & 4b depict the error rate and the runtime in ms for the six classification algorithms. The

results clearly indicate that rule based classifiers are very competitive in predicting phishing websites

since RIPPER and C4.5 were the top two classifiers in regards to predictive accuracy measure. As a

matter of fact, when calculating the average error rate of the four rule based classifiers (C4.5-Rules,

eDRI, RIPPER, RIDOR) and comparing it with each classic classifier’s error rate the rule based classifiers

still dominate. The average error rate of the rule based classifiers is 5.93% and this is higher than those

of Bayes Net, and Simple Logistics by 1.08%, and 0.47% respectively. This is if limited a direct indication

that classifiers with rules are not only beneficial for decision makers but normally achieve a higher

predictive performance than traditional classification algorithms such as probabilistic, and simple

logistic at least for phishing detection of websites.

Figure 4b illustrates the runtime in ms taken to build the predictive models by the six classifiers. The

figure indicates that not only are rule based classifiers more predictive than other classification

algorithms but they are also very competitive with respect to the time taken to build the classifiers. The

fastest algorithms are eDRI and BayesNet. Bayes Net utilises simple calculations of the likelihood of

each class in the dataset based on probability theory and hence there is no pruning or rule discovery

involved. Hence, it is fast when deriving predictive models which explains the minimal time taken to

build the anti-phishing models of Figure 4.

We have looked into the features that have a high influence on the target class in order to seek a

more concise rule based model for phishing. To achieve this purpose, we employed CFS filtering (Hall,

1999) in order to remove redundant features in the early stages before building the models. The most

Figure 3 the number of rules generated by the rule based predictive models

0

20

40

60

80

100

120

140

160

180


effective phishing features that CFS chose are (Prefix_Suffix, having_Sub_Domain, SSLfinal_State,

Request_URL, URL_of_Anchor, Links_in_tags, SFH, web_traffic, Google_Index). All other features

have been discarded before we run the classifiers against the aforementioned features. Figure 5

demonstrates that despite the dimensionality reduction of the dataset where 21 features have been

removed still the predictive performance is sustained and only slightly dropped by less than 1%.

Moreover, rule based classifiers maintained good predictive performance when compared with classic

classifiers after pre-processing the phishing dataset. The average rule based classifiers’ error (C4.5-

Rules, eDRI, RIPPER, RIDOR) was slightly lower than that of Bayes Net and Simple Logistics. Overall,

feature selection has significantly cut down the number of unnecessary features without drastically

increasing the error rate; this was clear from Figure 5.

Finally, we investigated the content of the classifiers generated by C4.5-Rules, RIPPER, eDRI to seek

common useful knowledge for end-users. Based on the predictive models of these algorithms, it seems

that the ways the rule based algorithms induce the rules are different, which is reflected on their

classifiers’ content. For instance, RIPPER tends to focus on phishing class (class = -1) during rule

induction and hence it generated rules only for that class label and when none of these rules can be

applied RIPPER tends to assign class = 1, i.e. “legitimate”, to the test data. On the other hand, the RIDOR

algorithm discovers rules that belong to only legitimate websites and when none of these rules can be

applied to a test data it assigns class=-1. Thus, we think that the rules in both RIPPER and RIDOR

classifiers are imbalanced with respect to class label and therefore the knowledge derived for both

algorithms on the phishing problem are inadequate. On the other hand, eDRI algorithm derived anti-

phishing classifiers that contain nine rules with phishing class and sixteen rules with legitimate class.

Hence, both class labels are represented to a certain degree within eDRI classifier and the user can

distinguish between the classes based on the content of the classifier. Moreover, the number of rules in

eDRI when compared with RIDOR and RIPPER algorithms are smaller and thus maintaining eDRI

rules by the end-user is easier than RIDOR and RIPPER. Overall, we found out that the below rules are

substantial for the end-user in distinguishing phishing websites:

If URL_of_Anchor = -1 and SSL_Final_state = -1 Then website = -1

If URL_of_Anchor = -1 and SSL_Final_state = 0 Then website = -1

If Prefix_Suffix = 1 Then website= 1

Figure 4a Error rate required to build all predictive models of the

considered algorithms

0.00

1.00

2.00

3.00

4.00

5.00

6.00

7.00

8.00

C4.5-Rules RIPPER eDRI Ridor Bayes Net SimpleLogistic

Error rate %

Figure 4b Runetime in ms required to build all predictive models of the

considered algorithms

0

5

10

15

20

25

30

35

40

45

C4.5-Rules RIPPER eDRI Ridor Bayes Net SimpleLogistic

Run Time ms

4. Conclusions

Phishing normally involves creating a fake website that has identical similarity to an existing trusted

business website and aims to trick users and access their financial assets. One way to minimise the risks

associated with phishing is to build automated predictive models using rule based classification

techniques. This is since models that contain simple If-Then knowledge are favoured by end users in

understanding and controlling this serious web threat. Hence, this paper critically analysed and then

thoroughly experimented with a number of rules based classifiers and other non-rule based algorithms

on the phishing detection problem. In particular, we have evaluated the rule based classification

approach and its applicability on website phishing using a real dataset. This dataset was recently

published with the University of Irvine data repository. To achieve the aim, four rule based classifiers

that belong to different families of algorithms have been used (C4.5-Rules, RIPPER, eDRI, RIDOR)

along with other two non-rule classic classification algorithms (Probabilistic-Bayes Net, Simple

Logistic). The bases of comparison were detection error rate, time taken to build the predictive models

in ms and the content of the models. We have also taken features filtering into consideration and its

effect on phishing detection rate.

The experimental results against large phishing websites revealed that rule based classifiers are a highly

useful anti-phishing techniques since they derived moderate size models without hindering the

predictive accuracy performance. In particular, the average error of the rule based classifiers especially

eDRI was smaller than that of each classic classification algorithm. Moreover, eDRI was the fastest

method to generate predictive models among the rule based classifiers and this algorithm was highly

competitive Naïve Bayes. More importantly, when the correlation feature set filtering was employed

eDRI was barely impacted and the algorithm was able to produce anti-phishing models with less than

a 1% decrease in error rate by using just nine features instead of thirty features. Overall, all rule based

predictive models performed well with respect to detection rate and runtime which indicates the

suitability of these approaches in detecting phishing. In fact, the rules discovered by eDRI and RIPPER

algorithms, are influential in differentiating between websites since they can serve as decision tools for

end-user to fight phishing.

In the near future we intend to design rule pruning methods to further reduce the number of rules

derived by rule based predictive models.

Figure 5 Error rate derived by the considered algorithm after CFS filtering method was applied

0.00

1.00

2.00

3.00

4.00

5.00

6.00

7.00

8.00

9.00

C4.5-Rules RIPPER eDRI Ridor Bayes Net Simple Logistic

Error rate

References

1. Abdehamid N. (2015) Multi-label rules for phishing classification. Applied Computing and

Informatics 11 (1), 29-46.

2. Abdelhamid N., Thabtah F., (2014) Associative Classification Approaches: Review and

Comparison. Journal of Information and Knowledge Management (JIKM). Vol. 13, No. 3 (2014)

1450027.

3. Abdelhamid N., Thabtah F., Ayesh A. (2014) Phishing detection based associative classification

data mining. Expert systems with Applications Journal. 41 (2014) 5948–5959.

4. Abdelhamid N, Ayesh A., Thabtah F. (2013) Phishing Detection using Associative Classification

Data Mining. ICAI'13 - The 2013 International Conference on Artificial Intelligence, pp. (491-499).

USA.

5. Abdelhamid N., Ayesh A., Thabtah F. (2012) An Experimental Study of Three Different Rule

Ranking Formulas in Associative Classification Mining. Proceedings of the 7th IEEE International

Conference for Internet Technology and Secured Transactions (ICITST-2012), pp. (795-800), UK.

6. Aburrous M., Hossain M., Dahal K.P. and Thabtah F. (2010) Associative Classification techniques

for predicting e-banking phishing websites. Proceedings of the 2010 International Conference on

Information Technology, Las Vegas, Nevada, USA, 2010, pp. 176-181.

7. Aburrous M., Hossain A., Dahal K., Thabtah F. (2008) Intelligent Quality Performance Assessment

for E-Banking Security using Fuzzy Logic. Proceedings of the 7th IEEE International Conference on

Information Technology (ITNG 2008). Las Vegas, USA.

8. Afroz, & Greenstadt, R. (2011) PhishZoo: Detecting Phishing Websites by Looking at Them. In Fifth

International Conference on Semantic Computing (September 18- September 21). Palo Alto,

California USA, 2011. IEEE.

9. Bouckaert, R. R., (2004). Bayesian network classifiers in Weka. (Working paper series. University of

Waikato, Department of Computer Science. No. 14/2004). Hamilton, New Zealand: University of

Waikato.

10. Cendrowska, J. (1987) PRISM: An algorithm for inducing modular rules. International Journal of

Man-Machine Studies, Vol.27, No.4, 349-370.

11. Cohen, W.W., 1995. Fast Effective Rule Induction. In In Proceedings of the Twelfth International

Conference on Machine Learning. Tahoe City, California, 1995. Morgan Kaufmann.

12. Fette I., Sadeh N., Tomasic A. (2007) Learning to detect phishing emails. Proceedings of the 16th

international conference on World Wide Web. 649-656.

13. Gaines, B.R., Paul Compton, J. (1995) Induction of Ripple-Down Rules Applied to Modeling Large

Databases, Intell. Inf. Syst. 5(3):211-228

14. Hall M., Frank E., Holmes G., Pfahringer B., Reutemann P., Witten I. (2009) The WEKA Data Mining

Software: An Update; SIGKDD Explorations, Volume 11, Issue 1.

15. Hall M. (1999) Correlation-based Feature Selection for Machine Learning. Thesis, department of

computer science, Waikaito University, New Zealand.

16. Khadi A., Shinde S. (2014) Detection of phishing websites using data mining techniques.

International Journal of Engineering Research and Technology, Volume 2(12).

17. Liu, H. and Setiono, R. (1995) Chi2: Feature Selection and Discretization of Numeric Attribute.

Proceedings of the Seventh IEEE International Conference on Tools with Artificial Intelligence,

November 5-8, 1995, pp. 388.

18. Liu, B.; Hsu, W.; and Ma, Y. (1998) Integrating classification and association rule mining. In Proc.

1998 Int. Conf. Knowledge Discovery and Data Mining (KDD’98), 80–86.

19. Mohammad R., Thabtah F., and McCluskey L. (2013A) Intelligent Rule based Phishing Websites

Classification. IET Information Security, vol. 8, no. 3, pp. 153-160, July 2013-A.

20. Mohammad R., Thabtah F., and McCluskey L. (2015) Phishing Dataset. [Online].

http://eprints.hud.ac.uk/24330/

21. Mohammad R., Thabtah F., McCluskey L., (2013B) Neural Network based Algorithm for Web

Security. ICAI'13 - The 2013 International Conference on Artificial Intelligence. pp. (581-587). USA.

22. Mohammad R., Thabtah F., McCluskey L., (2012) An Assessment of FeaturesRelated to Phishing

Websites using an Automated Technique. Proceedings of the 7th IEEE International Conference

for Internet Technology and Secured Transactions (ICITST-2012). pp. (492-497), UK.

23. Qabajeh I., Thabtah F., Chiclana F. (2015) Dynamic Classification Rules Data Mining Method.

Journal of Management Analytics.Volume 2, Issue 3, pp. pages 233-253. Wiley.

24. Quinlan, J. (1993) C4.5: Programs for machine learning. San Mateo, CA: Morgan Kaufmann.

25. Quinlan, J. (1979) Discovering rules from large collections of examples: a case study. In Expert

Systems in the Micro-electronic Age. Edinburgh, 1979.

26. Sumner M., Frank E., Hall M. (2005) Speeding up logistic model tree induction Knowl. Discov.

Databases: Pkdd, 2005 (3721) (2005), pp. 675–683.

27. Thabtah F., Abdelhamid N. (2016) Deriving Correlated Sets of Website Features for Phishing

Detection: A Computational Intelligence Approach. Journal of Information & Knowledge

Management, 1650042.

28. Thabtah F., Qabajeh I.., Chiclana F. (2016A) Constrained dynamic rule induction learning. Expert

Systems with Applications 63, 74-85.

29. Thabtah F., Mohammad RM., McCluskey L. (2016B) A dynamic self-structuring neural network

model to combat phishing. Neural Networks (IJCNN), 2016 International Joint Conference, 4221-

4226. Canada.

30. Thabtah F. Hammoud S. Abdeljaber H. (2015) Parallel and Distributed Single and Multi-label

Associative Classification Data Mining Frameworks based MapReduce. Journal of Parallel

Processing Letter. Parallel Process. Lett. 25, 1550002 .World Scientific.

31. Thabtah F., Hammoud S.. (2013) MR-ARM: A MapReduce Association Rule Mining. Journal of

Parallel Processing Letter, 23 (3) 1-22, 1350012. World Scientific.

32. Thabtah F., Gharaibeh O., Al-zubaidy R. (2012) Arabic Text Mining for Rule based Classification.

Journal of Information and Knowledge Management (JIKM). Volume: 11, Issue: 1(2012) pp. 1-10.

WorldScinet.

33. Thabtah F., Gharaibeh O., Abdel-jaber H. (2011) Comparison of rule based classification techniques

for the Arabic text. Proceedings of the IEEE ISIICT conference, pp. 75-83. Amman, Jordan.

34. Thabtah F. (2007) A review of associative classification mining. The Knowledge Engineering

Review 22 (01), 37-65.

35. Witten I. H., Frank E., Hall M. A., Pal C. J. (2016) Data mining: practical machine learning tools and

techniques with Java implementations, Morgan Kaufmann, 2016.

Phishing Detection: A Case Analysis on Classifiers with ...€¦ · (Thabtah and Abdelhamid, 2016)...

Documents

Transcript of Phishing Detection: A Case Analysis on Classifiers with ...€¦ · (Thabtah and Abdelhamid, 2016)...