A comparative analysis of classification algorithms in data mining for accuracy, speed and...
Transcript of A comparative analysis of classification algorithms in data mining for accuracy, speed and...
A comparative analysis of classification algorithms in data miningfor accuracy, speed and robustness
Neslihan Dogan • Zuhal Tanrikulu
Published online: 12 August 2012
� Springer Science+Business Media, LLC 2012
Abstract Classification algorithms are the most com-
monly used data mining models that are widely used to
extract valuable knowledge from huge amounts of data.
The criteria used to evaluate the classifiers are mostly
accuracy, computational complexity, robustness, scalabili-
ty, integration, comprehensibility, stability, and interest-
ingness. This study compares the classification of
algorithm accuracies, speed (CPU time consumed) and
robustness for various datasets and their implementation
techniques. The data miner selects the model mainly with
respect to classification accuracy; therefore, the perfor-
mance of each classifier plays a crucial role for selection.
Complexity is mostly dominated by the time required for
classification. In terms of complexity, the CPU time con-
sumed by each classifier is implied here. The study first
discusses the application of certain classification models on
multiple datasets in three stages: first, implementing the
algorithms on original datasets; second, implementing the
algorithms on the same datasets where continuous vari-
ables are discretised; and third, implementing the algo-
rithms on the same datasets where principal component
analysis is applied. The accuracies and the speed of the
results are then compared. The relationship of dataset
characteristics and implementation attributes between
accuracy and CPU time is also examined and debated.
Moreover, a regression model is introduced to show the
correlating effect of dataset and implementation conditions
on the classifier accuracy and CPU time. Finally, the study
addresses the robustness of the classifiers, measured by
repetitive experiments on both noisy and cleaned datasets.
Keywords Classification � Data mining �Mining methods
and algorithms
1 Introduction
Classification or prediction tasks are the most widely used
types of data mining. Classification algorithms are super-
vised methods that look for and discover the hidden asso-
ciations between the target class and the independent
variables [45]. Supervised learning algorithms allow tags to
be assigned to the observations, so that unobserved data
can be categorized based on the training data [25]. The
task, model structure, score function, search method, and
data management method are the main components of each
algorithm [26]. Image and pattern recognition, medical
diagnosis, loan approval, fault detection, and financial
trends are among the most well known examples of clas-
sification tasks [20].
Before utilizing a model produced by a classification
algorithm, that model is assessed with respect to some
specific criteria. The model will probably result in certain
errors; therefore, the data miner should take this possibility
into account when selecting a model [17]. Accuracy or the
percentage of instances that are correctly classified by the
model is the most commonly used decision criteria for
most model assessments [25].
However, other criteria can also be used to compare and
evaluate the models. Berson et al. [8] defines these assess-
ment concepts as: accuracy, explanation, and integra-
tion abilities. Maimon and Rokach [45] introduce these
N. Dogan (&)
PricewaterhouseCoopers LLP, London, UK
e-mail: [email protected]
Z. Tanrikulu
Department of Management Information Systems,
Bogazici University, Istanbul, Turkey
e-mail: [email protected]
123
Inf Technol Manag (2013) 14:105–124
DOI 10.1007/s10799-012-0135-8
comparison criteria as the generalization error for the model,
a computational complexity or the amount of CPU consumed
by the inducer, the comprehensibility or ability to understand
the model, the scalability or ability to run efficiently on larger
databases, the robustness or ability to handle missing or
noisy data, the stability or ability to produce repeatable
results on different datasets, and lastly, the interesting nature
or the ability of classifier to generate valid and new
knowledge.
Before implementing classification algorithms, it is
recommended that incomplete, noisy, or inconsistent
datasets are preprocessed to make the knowledge discovery
process easier and more qualified. The most well-known
steps for this process are: summarization, cleaning, inte-
grations and transformations, data and dimensionality
reduction, and discretisation [25]. Discretisation and
dimension reduction lie within the scope of this study.
Data discretisation techniques can be used to reduce the
number of values for a given continuous variable by splitting
the range of that variable into intervals. Binning, for exam-
ple, is a type of discretisation technique in which the variable
is split into a particular number of bins. Dimension reduction
is another preprocessing technique used to obtain a reduced
dataset that still represents the original dataset. The most
commonly used dimension reduction technique is principal
component analysis (PCA). ‘‘PCA searches for k n-dimen-
sional orthogonal vectors that can best be used to represent
the data where k� n. The original data are thus projected
onto a smaller space’’ [25].
As data volume increases in real life today, it is
becoming harder to make valuable and significant deci-
sions with respect to that increase. In such situations, data
mining to extract the concealed knowledge from large
amounts of data is commonly used [25]. The predictive
power of data mining classification algorithms has been
appealing for many years. Numerous studies have thus
concentrated on proposing a new classification model by
comparing the existing models or important factors that
affect a model’s performance.
Quinlan states that it is not an easy task to determine that
one algorithm is always superior to others, and he links the
capabilities of models to task dependency. His study com-
pares the decision tree with network algorithms and con-
cludes that parallel-type problems are not common, nor are
sequential-type problems suited to back-propagation [53]. In
an additional study, some algorithms, such as LARCKDNF,
IEKDNF, LARC, BPRC and IE, were compared for three
tasks, and different results were found for each one [33].
Hacker and Ahn conducted another comparative experiment
that focused on eliciting user preferences. They compared
many methods and recommended a new classifier called
relative SVM, which outperformed others [24]. Further,
Putten et al. [51] compared the AIRS algorithm to other
algorithms, and found no significant evidence that it con-
sistently outperformed the others. Another research imple-
mented Naı̈ve Bayesian, a decision tree, KNN, NN and M5 to
predict the lifetime prediction of metallic components and
stated that methods dealing directly with continuous vari-
ables perform better [23]. In a comparative paper, the authors
selected 16 model selection schemes and 58 benchmark
datasets. Their paper first indicated the rationale and com-
plexity of the schemes as a reference guide; secondly, it
provided a bias-variance analysis for each scheme classifi-
cation performance; and lastly, it determined the effective
schemes that can meet the practical requirements [65]. In
another study, the same authors point at useful data mining
implications to try to understand whether meaningful rela-
tionships can be found in the soil profile data at different
regions. They used the data collected from the WA Depart-
ment of Agriculture and Food (AGRIC) soils database and
compared those data mining methods to the existing statis-
tical methods [4]. The importance of feature selection is
emphasized in a study of He et al. [27] where the decision tree
and regression methods are applied to breastfeeding survey
data. In another comparative study, authors compared 22
decision tree, nine statistical, and two neural network algo-
rithms on 32 datasets with respect to the classification
accuracy, training time, and number of leaves (in decision
tree algorithms). They concluded that algorithm called
POLYCLASS was ranked top in terms of accuracy (mean
error rate and mean rank of error rate). The second best
algorithm was found to be Logistic regression. QUEST was
found to be the most accurate decision tree algorithm. In
terms of training duration, the statistical algorithms needed
relatively longer training times [43]. A variety of techniques
can be used for the detection of network intrusions ranging
from traditional statistical methods to data mining approa-
ches. In another comparative study, authors compared rough
sets, neural networks and inductive learning methods in
detecting network intrusion. The results indicate that data
mining methods and data proportion have a significant
impact on classification accuracy. Rough sets proved better
accuracy and balanced data proportion performed better than
unbalanced data proportion [66]. Moreover, authors in the
‘Top 10 algorithms’ book presented C4.5, k-Means, SVM,
Apriori, EM, PageRank, AdaBoost, kNN, Naı̈ve Bayes, and
CART; provided a description of the algorithms; discussed
the impact of the algorithms, and reviewed current and fur-
ther research on the algorithms [64].
Jamain and Hand conducted an interesting study on col-
lecting and comparing the comparative studies of classifi-
cation methods. They claimed that most results are
inconclusive and limited. The authors reviewed the literature
and created a dataset of 5807 results. They also presented a
method to assess the overall methods [31]. In another study,
Kim et al. [37] proposed their own two methods: per-
106 Inf Technol Manag (2013) 14:105–124
123
document text normalization and feature weighting, so as to
eliminate the deficiencies of the Naı̈ve Bayesian algorithm in
text classification tasks. They claimed their methods perform
well with respect to many novel methods, such as SVM. Su
and Hsiao’s study concerned the Multiclass Mahalanobis-
Taguchi System (MMTS), developed to combine feature
selection and classification. This system uses a reduced
model measurement scale and examples to calculate a min-
imum weighted distance. The authors also compared their
system to other well-known algorithms and results in terms
of accuracy and feature selection. In addition, they imple-
mented the system in a real-life case and pointed to its
practicality [57].
In retrospect, some researchers have attempted to show
the importance of datasets in determining classifications. A
crucial point is also introduced about the danger of using a
single dataset for performance comparison, and tests are
carried out for dynamic modifications of penalty and net-
work architectures [28]. A similar finding also is stated by
Keogh and Kasetty [35] since the performance results of
learning algorithms are expected to deviate across different
datasets; the study discusses data and implementation bias
for time series datasets. Brazdil et al. [11] presented a meta-
learning method to help the process of algorithm selection.
They used the k-Nearest Neighbor algorithm to identify the
datasets that are most similar to the one at hand. The per-
formance and speed of the candidate algorithms on those
datasets were used to populate a ranking to be provided to
the user. The distance between datasets is based on a small
set of data characteristics that represent a set of properties
that affect the performance of the learning algorithms.
Some of those characteristics were number of examples,
proportion of symbolic attributes and outliers and entropy
of classes. In another interesting study, authors tried to
discover the similarities among classification algorithms,
and among datasets on the basis of error measures. The most
frequent finding was that the described areas were charac-
terized by rather concentrated distributions of descriptors
such as data availability, problem dimensionality (number
of examples, ratio of examples to attributes, etc.) or class
distribution or information content [34].
The importance of implementation settings while run-
ning algorithms has also been underlined by some authors
in the literature. For example, Keogh et al. [36] emphasize
the importance of implementation details, such as param-
eter selections in algorithms and claim that algorithms
should have few or no parameters. Pitt and Nayak point to
one factor that affects accuracy in their study: ‘‘The use of
feature reduction algorithms on a large population survey
database has shown that the use of the subset and attribute
evaluation methods mostly results in an improvement in
accuracy despite a reduction in the number of attributes’’
[50]. Finally Howley et al. [30] studied the effects of data
preprocessing steps on classifier accuracies and compared
the results of classifiers where no preprocessing step was
applied and then applied additional techniques, such as
normalization or PCA.
The literature also shows many studies that reveal the
practical implementations of data mining algorithms.
Maindonald points to the difficulties and complexities of
comparing the algorithms and underlines the fact that users
who are more experienced working with a specific model
will have a tendency to produce the best results with that
model; therefore, the published performance results are
very broad indicators and dependent on datasets. Moreover,
he points to the insufficiency of datasets from several years
to determine changes in algorithm performances [46]. Ko
and Osei-Bryson tried to show the significance of Infor-
mation Technology (IT) investments on the healthcare
industry productivity in their research where they utilize
data mining implementations of Regression and MARS
(Multi Adaptive Regression Splines) algorithms. They
discover that the relationship between IT spending and
productivity is considerably complex and dependant on
such conditions as the amounts invested in IT Stock, Non-
IT Labor, Non-IT Capital and Time [38]. The research of
Abbasi and Chen then pays attention to categorization of
fake escrow websites, which make it hard to distinguish
legal websites from invalid ones. They assess the effec-
tiveness of various techniques, such as support vector
machines, neural networks, decision trees, Naı̈ve bayes,
and principal component analysis. A support vector
machine classifier, when combined with an extended fea-
ture set, differentiates fake pages from real ones with 90 %
accuracy for pages and 96 % accuracy for sites. The
authors also claim that an extended set of fraud features is
required because of the large range of fraud strategies [1].
Another interesting paper reveals the effectiveness of data
mining techniques on the prediction of fall-related injuries
for electronic medical records. Chiarini et al. [16] both
applied unsupervised (entropy) and supervised (informa-
tion gain) text mining techniques. Text mining alone is
deemed useful for identifying fall-related injuries. Then
they used the terms created by both schemes to create
clustering and logistics regression and concluded that
information gain outperformed entropy for both clustering
and logistics regression. Also, clustering on entropy-based
terms did not produce viable results; however, even
entropy performed well for the classification task and did
categorize cases correctly when data was not available.
As seen in this literature review, the data mining com-
munity is very interested in comparing different classifi-
cation algorithms. For example, Dogan and Tanrikulu
proposed a comparative framework for evaluating classifier
accuracies. They claim that classifier accuracies are not
always the same for every dataset, and performance is,
Inf Technol Manag (2013) 14:105–124 107
123
therefore, significantly affected by such dataset character-
istics as variable types or the number of instances [19].
This study is also concerned with classification perfor-
mance and other factors that can affect accuracy by
applying new perspectives, as well as other quality indi-
cators such as classifier speed or robustness, to determine
classifier quality.
2 Research questions
This study aims to compare the classification algorithm
accuracies, speed and robustness with respect to various
datasets and implementation techniques. The research
questions for the study are as follows:
1. When implemented with different techniques, does the
performance of classifiers deviate significantly on
multiple datasets?
2. Is the performance of classifiers significantly affected
by dataset characteristics?
3. Does binning the continuous variables in the dataset
into discrete intervals affect classifier accuracy?
4. Does applying principal component analysis in the
dataset affect the classifier accuracy?
5. Based on the empirical results of this study (applying
classifiers on various datasets with different imple-
mentation techniques), what is the overall effect of
dataset and implementation attributes on the accuracy
of the classification algorithm?
6. Implemented using different techniques, does the
speed (consumed CPU time in seconds) of classifiers
deviate significantly on multiple datasets?
7. Based on the empirical results of this study (applying
classifiers on various datasets with different imple-
mentation techniques), what is the overall effect of
dataset and implementation attributes on speed for the
classification algorithm?
8. Does the speed of the algorithm significantly affect the
accuracy of the classification algorithm?
9. Do the abilities of classifiers to handle missing or noisy
data differ (their robustness)?
3 Method
Figure 1 shows the methodological framework maintained
during the research study. The accuracy is referred to as the
performance of classifiers throughout the study. Figure 1
shows the methodological framework maintained during
the research study. In the implementation phase, ten sample
datasets were used because the research study is interested
in applying the algorithms to multiple datasets. Fourteen
classification algorithms that are explained in Table 1 were
selected for implementation on the experimental datasets.
WEKA (Waikato Environment for Knowledge Analysis), a
preferred suite of machine learning software, was used here
as a tool to run AIRS2P, C4.5, CSCA, IBk, Logistics,
LogitBoost, MLP, MLVQ, Naı̈ve Bayesian and SVM
algorithms. SPSS (Statistical Package for the Social Sci-
ences) was also used as a tool to run the CART, Ex-CHAID
and QUEST algorithms, since these algorithms are avail-
able in SPSS. Finally, the Rosetta tool was used to run the
RSES algorithm. Data preprocessing steps were also
applied to the sample datasets, and the results of those
implementations were then tabulated. Afterwards, a
descriptive analysis and a One-way Anova test were car-
ried out to answer the first and sixth research questions.
Further, correlation analysis was conducted to answer the
second, third and fourth research questions. Lastly,
regression models were built to deal with the fifth and
seventh research questions. To learn whether there is a
relationship between accuracy and the speed of the clas-
sifier, correlation analysis was applied to answer question
eight. To answer the ninth research question, algorithms
were implemented on a noisy dataset before and after data
cleaning, and the results containing the variances were then
tabulated to determine the robustness of the classifiers.
3.1 Algorithms
Fourteen classification algorithms representing the differ-
ent types of classification models (decision trees, neural
networks, immune systems, probabilistic models, etc.)
were selected from the many existing classification algo-
rithms that fell within the scope of this study.
The selected algorithms were AIRS2P and CSCA arti-
ficial immune recognition systems algorithms; Ex-CHAID,
C4.5, CART and QUEST decision tree algorithms; the IBk
algorithm; the Logistics algorithm; the LogitBoost boosting
algorithm; MLP and MLVQ neural network algorithms; the
Naı̈ve Bayesian algorithm; the RSES rough sets algorithm
and the Support Vector Machine algorithm. Table 1 shows
the list of algorithms used in the study along with the tool
they were implemented in, the category they belonged to,
and finally, the people introduced the classifier.
Similar to the human natural immune system which
differentiates and recalls intruders, AIRS algorithm is a
cluster-based approach that understands the structure of the
data and performs a k-nearest neighbor search. AIRS2 and
AIRS2P are extensions of the existing AIRS algorithm with
some technical differences [51, 59–61]. Another artificial
immune system technique that is inspired by the func-
tioning of the clonal selection theory of acquired immunity
is the Clonalg algorithm. It is inspired by the ‘‘maintenance
of a specific memory set, selection and cloning of most
108 Inf Technol Manag (2013) 14:105–124
123
stimulated antibodies, death of non-stimulated antibodies,
affinity maturation (mutation), re-selection of clones pro-
portional to affinity with antigen, and the generation and
maintenance of diversity’’ [14]. A variant implementation
of Clonalg is called a Clonal Selection Classifier Algorithm
(CSCA), and it aims to maximize classification accuracy
and minimize misclassification rate [3, 13].
In decision tree algorithms, the classification procedure
is condensed into a tree. After the model is constructed, it is
applied to the entire database [20]. The J48 algorithm is a
version of an earlier algorithm that was developed by J.
Ross Quinlan, namely, the very popular C4.5. C4.5
employs two pruning methods. The first is known as the
sub-tree replacement, and the second is known as sub-tree
rising [52, 56]. CART (Classification and Regression
Trees) is another decision tree classifier that uses binary
splits, first grows, and then, prunes and uses the Gini Index
as splitting criteria and surrogates missing values [12, 63].
The CHAID algorithm cultivates the tree by locating the
optimal splits until the stopping criteria is encountered with
respect to the Chi-squares [17]. The CHAID can deal with
missing values, and the outputs of the target function are
discrete [47]. The splitting and stopping steps contained in
the Exhaustive CHAID algorithm that were proposed in
1991 are the same as those in CHAID. The merging step
uses an exhaustive search process to merge any similar
pairs until a single pair remains. However, in large datasets
with many continuous predictor variables, this modified
CHAID algorithm may require significant computing time
[29]. The Ex-Chaid (Exhaustive CHAID) algorithm was
introduced by Biggs in 1991. This method, like CHAID
chooses the best partition on the basis of statistical
Algorithm selection
Can amodel that shows the overall factors affecting classifieraccuracy or speed
be built ?
Do dataset characteristics,
discretisation, or PCA affect the
classifier accuracy? Does speed affect
accuracy?
Are the classifer accuracies or
speed significantly
different from each other?
Dataset collection
1. Basic applications2. Applications after discretisation3. Applications after PCA4. Applications on a noisy dataset before and after data cleaning
Correlation analysis
Descriptive analysis, One-
way Anova difference test
Performance and Speed results ofclassifiers on multiple datasets with different
implementation techniques and Accuracy results of classifiers before and after data cleaning
Regression Descriptive analysis
Do the abilities of classifiers to handle missing or noisy data
differ?
Statistical Results for Accuracy, Speed and Robustness
Fig. 1 Methodological framework
Inf Technol Manag (2013) 14:105–124 109
123
significance and uses the Bonferroni inequality to calculate
that significance. Unlike CHAID, this algorithm does not
support simple partitions (low values of k), nor does it
discriminate against free-type (no restriction on the order of
values) predictor variables with many categories [9]. The
QUEST (Quick, Unbiased and Efficient Statistical Tree) is a
binary-split-decision tree algorithm used for classification
and data mining [55]. The objective of QUEST is similar to
that of the CART algorithm; it uses an unbiased variable
selection technique by default, uses imputation instead of
surrogate splits to deal with missing values, and can easily
handle categorical predictor variables with many categories
[44]. Badulescu [5] points to the difficulty of selecting the
best attribute when splitting the decision tree in the model
induction phase, and he compared the performance of 29
different splitting measures, claiming that the FSS Naı̈ve
Bayesian splits the attributes best.
IBk (The Instance-based k-nearest Neighbors) algorithm
is an alternate version of the k-nearest neighbor algorithm
used in k-nearest neighbor classification. It is a framework
that creates classification predictions by using only specific
instances. The IBk algorithm does not keep a set of
abstractions derived from instances. This method uses and
extends the nearest neighbor algorithm. However, this
nearest neighbor is supposed to require a larger storage
space. By using an instance-based algorithm, this storage
need can be significantly reduced, and learning or accuracy
rates do not decrease much [2].
The Multinomial Logistic Regression Model utilizes
independent variables to predict the probability of events
by fitting the data to a logistic curve [26]. It is developed by
using ridge estimators in logistic regression to improve the
parameter estimates and to diminish the error made by
further predictions [42].
LogitBoost is a boosting algorithm that was formulated
by Jerome Friedman, Trevor Hastie, and Robert Tibshirani.
The AdaBoost algorithm is spread into a statistics
framework. Specifically, if AdaBoost is regarded to be a
Table 1 The algorithms used for the experiments
Algorithm name Acronym Category Years Introduced by Implementation
tool
Artificial Immune Recognition
Systems v2 Parallel
AIRS2P Artificial immune systems
classification
2005 Watkins [60] WEKA
v. 3.6.4
C4.5 Decision Tree Revision 8 C4.5/J48 Decision tree classification 1993 Quinlan [52] WEKA
v. 3.6.4
Classification and Regression Trees CART Decision tree classification 1984 Breiman et al. [12] SPSS
v.17.0
Colonal Selection Classification
Algorithm
CSCA Artificial immune systems
classification
2005 Brownlee [13] WEKA
v. 3.6.4
Exhaustive Chi-Square Automatic
Interaction Detection
Ex-CHAID Decision tree classification 1991 Bigss et al. [9] SPSS
v.17.0
Instance-based k-nearest Neighbors IBk K nearest neighbors (k-NN)
classification
1991 Aha et al. [2] WEKA
v. 3.6.4
Multinomial Logistic Regression
Model
Logistic Regression-based
classification
1992 Le Cessie and van
Houwelingen [42]
WEKA
v. 3.6.4
Logistic Boost LogitBoost Additive logistic regression-
based boosting
2000 Friedman et al. [22] WEKA
v. 3.6.4
Multi-Layered Perceptron MLP Neural network classification 1986 Rumelhart et al. [54] WEKA
v. 3.6.4
Multi-Pass Learning Vector
Quantization
MLVQ Neural network classification 1990 Kohonen [39–41] WEKA
v. 3.6.4
Naı̈ve Bayesian Classification Naı̈ve
Bayes
Bayesian classification 1995 John and Langley [32] WEKA
v. 3.6.4
Quick, Unbiased, Efficient, Statistical
Tree
QUEST Decision tree classification 1997 Loh and Shih [44] SPSS
v.17.0
Rough Set Exploration System RSES Rough set approach (RSA) 2000 Prof. Andrzej Skowron’s
R&D team [7]
Rosetta
v. 1.4.41
Support Vector Machines (Library for
SVM)
SVM
(LibSVM)
Support vector classification 1992 Boser et al. [10] WEKA
v. 3.6.4
110 Inf Technol Manag (2013) 14:105–124
123
generalized additive model and then the cost function of
logistic regression is applied, the LogitBoost algorithm can
be derived. LogitBoost can be seen as a convex optimi-
zation and minimizes the logistic loss [22].
Multilayer Perceptron (MLP) is a type of artificial neural
network algorithm that considers the human brain as its
modeling tool [17, 54]. It provides a generic model for
learning real, discrete, and vector target values. The ability
to understand the concealed model is difficult, and training
times may be extensive [47].
MLVQ (Multi-Pass LVQ) is the recommended practice
for the LVQ (Learning Vector Quantization) algorithm
‘‘where a quick rough pass is made on the model using
OLVQ1and then a long fine tuning pass is made on the
model with any of LVQ1, LVQ2.1 or LVQ3’’. The purpose
of the algorithm is to estimate the distribution of a class by
using a reduced number of codebook vectors where the
algorithm tries to minimize classification errors. The
algorithm is connected with the neural network class of
learning algorithms; however, it works significantly dif-
ferently when compared to conventional feed-forward
networks like Back Propagation [62].
The Naı̈ve Bayesian algorithm identifies the classifica-
tion problem in accordance with the probabilistic phrases
and provides statistical methods used to categorize the
instances with respect to probabilities [17, 32].
Another algorithm used in this study is the Rough Set
Exploration System (RSES). The RSES algorithm was
created by the research team supervised by Professor
Andrzej Skowron [6, 7]. A rough set was first introduced
by Zdzisław I. Pawlak and is a formal estimation of a
conventional set in terms of a pair of sets that give the
lower and the upper estimations of the original set. In
the standard version of rough set theory, the lower- and
upper-estimation sets are conventional sets, while in
other variations; the estimating sets may be fuzzy sets
[48, 49].
A support vector machine (SVM) algorithm, introduced
by Boser, Guyon and Vapnik in 1992, analyses data and
recognizes patterns used for classification and regression
analysis. It builds a hyper plane or set of hyper planes in a
high- or infinite-dimensional space, which can then be used
for classification, regression, or other tasks. As expected,
an effective division is achieved by the hyper plane with
the largest distance to the nearest training data points of
any class (functional margin), since in general the larger
the margin, the lower the generalization error of the clas-
sifier [10, 18]. The WLSVM (Weka Library for SVM) can
be viewed as an implementation of the LibSVM running
under Weka environment [21]; the LibSVM package is an
efficient software tool for building SVM classifiers [15].
Each algorithm can utilize both numerical and categor-
ical variables as inputs. In addition, the classifiers can
handle target classes with more than two class types.
Algorithms can also be referred to as classifiers or models.
3.2 Data collection
All sample datasets were collected from the UCI Machine
Learning Repository [58]. The experimental datasets were:
Acute, Breast Cancer, CPU, Credits, Iris, Letters, Pittsburg,
Red Wine, Segment, Wine All and White Wine. Table 2
summarizes the attributes of each of the datasets.
3.3 Implementation
Before the implementation of the algorithms, the datasets
were initially cleaned of missing, noisy and incorrect data.
First, missing data were removed from the datasets which
were then also cleaned of noisy data. Unnecessary space
characters or other spelling mistakes were also cleaned in
the datasets.
Ten datasets (Acute, Breast Cancer, CPU, Credits, Iris,
Letters, Red Wine, Segment, Wine All and White Wine)
Table 2 Dataset characteristics
used for experimentsDataset name Number of
variables
Number of
nominal variables
Number of
numerical variables
Target class
types
Number of
instances
Acute 7 6 1 2 120
Breast cancer 9 0 9 2 684
CPU 6 0 6 3 210
Credits 15 9 6 2 653
Iris 4 0 4 3 150
Letters 16 0 16 26 20,000
Pittsburg 12 10 2 6 92
Red wine 11 0 11 6 1,599
Segment 19 0 19 7 1,500
Wine all 11 0 11 7 6,497
White wine 11 0 11 7 4,898
Inf Technol Manag (2013) 14:105–124 111
123
were used to run the 14 classification algorithms (AIRS2P,
C4.5, CART, CSCA, Ex-CHAID, IBk, Logistics, Logit-
Boost, MLP, MLVQ, Naı̈ve Bayesian, QUEST, RSES, and
SVM). For all the algorithms, tenfold cross-validation was
implemented for the same datasets. This stage of the
experiment, referred to as ‘‘basic implementation’’, resul-
ted in 140 (10 datasets 9 14 algorithms) rows of accuracy
and speed values, respectively.
After the basic implementation phase, all continuously
variables in the datasets were put into binned intervals
within ±1 standard deviation and saved as new variables.
Again, fourteen algorithms were implemented on the pre-
processed datasets that comprised discrete variables. The
second stage of the experiment, referred to as ‘‘after dis-
cretisation’’, resulted in another 140 rows of accuracy and
speed values (10 datasets 9 14 algorithms).
Following the second stage, principal component anal-
ysis was conducted on the dataset. Components with
eigenvalues over 1.0 were set as components and saved as
new variables. Again, fourteen algorithms were imple-
mented for the preprocessed datasets made up of the
principal components. The third stage of the experiment,
referred to as ‘‘after PCA’’, resulted in another 140 rows of
accuracy and speed values (10 datasets 9 14 algorithms).
In total, a dataset called ‘‘results’’ was obtained by the
420 rows of accuracy and speed values derived from the
three stages of the experiment. The 420-row dataset was
used to answer through the first eight research questions.
In order to answer the final research question, the
‘Breast Cancer’, ‘Credits’ and ‘Pittsburg’ datasets were
selected, and algorithms were run on them before and after
cleaning for noisy instances. These results were then
compared.
3.3.1 Software tools and hardware configuration
WEKA version 3.6.4, SPSS version 17.0, and Rosetta
version 1.4.41 were the main components used to run the
selected algorithms. In addition, SPSS was used to con-
duct all the statistical tests, such as correlations, regres-
sion, Anova, or descriptive tests. For this study, the
default parameter settings of WEKA, Rosetta, and SPSS
were implemented for the algorithms. Environmental facts
are also important for the nature of these experiments. All
data mining algorithms and statistical tests were con-
ducted on a personal computer with the following con-
figuration: Microsoft XP Professional Operating System
with Service Pack 3, 2 9 2.10 GHz CPU, 3 GB RAM,
and a 150 GB hard disk. Table 1 shows the list of algo-
rithms used in the study along with the tool in which they
were implemented, the category they belong to, the year
each was introduced, and finally the people introduced the
algorithms.
4 Results
In this section, the performance and speed results of each
algorithm for each case are discussed and research ques-
tions answered accordingly.
The percentage of instances that are correctly classified
helps to calculate an accuracy [20] that is referred to as the
performance of classifiers throughout the study. The costs
of a wrong assignment, in other words, misclassification
costs, are not within the scope of this study. As Han and
Kamber [25] describe this speed as ‘the computational
costs involved in generating and using the given classifier
or predictor’. In this study, speed is referred to as the CPU
time consumed by the classifiers during model building, in
other words, the time observed in seconds to generate the
classifier. The higher the values of time spent during
modeling, the slower the classifier.
4.1 Research Question 1: When implemented
with different techniques, does the performance
of classifiers deviate significantly on multiple
datasets?
Based on the findings of the empirical study, the same
classifier is not the best choice for all datasets, and the one
chosen always outperforms the other classifiers. For each
dataset, the best predictive classifier is defined for each
stage of the experiment. As Dogan and Tanrikulu [19] also
claim in their study, a classifier cannot be seen to outper-
form the other classifiers in every other dataset.
According to Table 3, the overall best accuracy is
obtained as 100 percent in the ‘Acute’ dataset. The clas-
sifiers producing that rate of accuracy are: AIRS2P, C4.5,
CSCA, IBk, LogitBoost, Logistics, MLP, MLVQ, Naı̈ve
Bayesian, RSES, and SVM. The ‘Acute’ dataset seems to
be the easiest one to predict; however, the only algorithms
which could not achieve that amount of accuracy were
CART, Ex-CHAID, and QUEST algorithms. Table 4
shows the ten worst prediction results, and the algorithms
with the lowest prediction abilities as MLVQ, QUEST, Ex-
CHAID, and CART. The ‘Letters’ dataset with the highest
number of instances and variables seems to have been the
hardest dataset to predict, and mostly, the decision tree
algorithms had the lowest prediction values.
Table 5 displays the detailed accuracy results for the
best result cases of each dataset in the basic implementa-
tions step. MLP produced the best accuracy for the ‘Acute’,
‘CPU’, ‘Iris’ and ‘Segment’ datasets. IBk had the best
performance for the ‘Acute’, ‘Red Wine’ and ‘White Wine’
datasets. Logistics had the best performance for the ‘Acute’
and ‘Wine All’ datasets. AIRS2P, C4.5, CSCA, Logit-
Boost, and SVM had the best accuracy for only the
‘Acute’ dataset. RSES produced the best result for only the
112 Inf Technol Manag (2013) 14:105–124
123
‘Letters’ dataset and CART has the best accuracy for the
‘Credits’ dataset. Interestingly, Ex-CHAID, Naı̈ve Bayes-
ian, and Quest never produced the best result for a dataset
from any of those classifiers. This result may mean that
they cannot handle continuous variables and dense
dimensionality as well as can MLP, IBk, or Logistics.
Since ‘Letters’ is the most complicated dataset, RSES
demonstrated a very powerful prediction for this complex
dataset. The accuracy result is very low of MLVQ on
Letters dataset, even lower than random guessing (1/
26 = 3.85 %). This must be due to the fact that not all
class types are represented in the data file. 10 out of 16
letter types are never represented in the data file which
causes a data bias and apparently reduces the learning
power of MLVQ algorithm. The accuracy of MLVQ is
dependent on the class distribution in the training dataset
since a good distribution of samples will ensure building
useful models.
Table 6 displays the detailed accuracy results for the
best result cases for each dataset after the discretisation
step. IBk had the best accuracy for the ‘Acute’, ‘Wine All’,
‘Wine Red’ and ‘Wine White’ datasets. RSES had the best
accuracy for the ‘Acute’, ‘Letters’, and ‘Segment’ datasets.
Logistics had the best performance for the ‘Acute’ and
‘Iris’ datasets. MLP had the best value for the ‘Acute’ and
‘CPU’ datasets. SVM predicted the ‘Acute’ and ‘Breast
Cancer’ datasets. CART and Ex-CHAID algorithms had
the best performance for the ‘Credits’ dataset, and finally
C4.5, CSCA, AIRS2P, LogitBoost, and MLVQ had the
best accuracy for only the ‘Acute’ dataset. When the
continuous variables were binned into intervals, IBk and
RSES started to predict better more often. This finding can
depend on their ability to handle discrete values better.
However, neither Naı̈ve Bayesian nor QUEST could still
predict as well as other algorithms.
Table 7 displays the detailed accuracy results for the
best result cases for each dataset after the PCA step. IBk
had the best performance for the ‘Acute’, ‘Segment’,
‘Wine All’ and ‘Wine Red’ datasets. MLP had the best
Table 3 Overall best accuracy results
Dataset Algorithm Integers binned PCA Accuracy
Acute AIRS2P, C4.5, CSCA, IBk, LogitBoost, Logistics, MLP, MLVQ, RSES, SVM No N 100
Acute AIRS2P, C4.5, CSCA, IBk, LogitBoost, Logistics, MLP, MLVQ, RSES, SVM Yes N 100
Acute AIRS2P, C4.5, CSCA, IBk, LogitBoost, Logistics, MLP, MLVQ, Naı̈ve Bayes, RSES, SVM Yes Y 100
Table 4 Overall worst accuracy results
Dataset Algorithm Integers
binned
PCA Accuracy
Letters CART YES N 35.3
Iris EXCHAID, QUEST YES Y 33.3
Letters QUEST YES N 31.1
Letters CART YES Y 31.0
Letters QUEST YES Y 27.6
Letters QUEST NO N 23.7
Letters MLVQ YES Y 3.85
Letters MLVQ YES N 3.80
Letters MLVQ NO N 3.78
Table 5 Best accuracy results for each dataset/basic implementations
Dataset Algorithm Accuracy
Acute AIRS2P, C4.5, CSCA, IBk,
LogitBoost, Logistics,
MLP, MLVQ, SVM
100
Breast cancer MLVQ 97.2
CPU MLP 97.2
Credits CART 87.4
Iris MLP 97.3
Letters RSES 100
Segment MLP 96.7
Wine all Logistics 77.4
Wine red IBk 64.7
Wine white IBk 65.3Table 6 Best accuracy results for each dataset/after discretisations
Dataset Algorithm Accuracy
Acute AIRS2P, C4.5, CSCA, IBk,
LogitBoost, Logistics, MLP,
MLVQ, RSES, SVM
100.0
Breast cancer SVM 97.1
CPU MLP 97.2
Credits CART, Ex-CHAID 86.4
Iris Logistics 90.7
Letters RSES 98.5
Segment RSES 89.9
Wine all IBk 61.6
Wine red IBk 63.0
Wine white IBk 61.7
Inf Technol Manag (2013) 14:105–124 113
123
accuracy for the ‘Acute’, ‘CPU’ and ‘Iris’ datasets. C4.5
had the best accuracy for the ‘Acute’ and ‘Wine White’
datasets. Logistics had the best value for the ‘Acute’ and
‘Credits’ datasets. LogitBoost had the best accuracy for the
‘Acute’ and ‘Breast Cancer’ datasets. RSES predicted the
‘Acute’ and ‘Letters’ datasets best. AIRS2P, CSCA,
MLVQ, Naı̈ve Bayesian, and SVM could only best predict
the ‘Acute’ dataset. After the PCA application, Naı̈ve
Bayesian and C4.5 started to predict better, and thus they
may be better at handling a lower amount of data and
dimensions. CART, Ex-CHAID, and QUEST did not pre-
dict the best in any dataset after component analysis.
The performance variable was binned into intervals as
LOW, MIDDLE, GOOD, and VERY GOOD with respect
to ±1 standard deviation. Table 8 shows the distribution of
each classifier across those performance intervals with
respect to all stages of the experiment. Table 8 shows that
the distribution of algorithms, such as C4.5 and IBk, were
mostly in the ‘Very Good’ interval; AIRS2P, Logistics and
MLP are other ‘Good’ ones. IBk was the only classifier
with no result in the ‘Low’ interval.
The basic concern of the first research question was to
determine whether the classifiers have significantly differ-
ent accuracies on multiple datasets. Even though the results
show that none of the classifiers is dominant and different
classifiers will predict better in different circumstances, a
One-way Anova test can help to better visualize the dif-
ferences between classifier accuracies. Table 9 shows that
the means of the prediction ability of the classifiers on
original datasets do not significantly differ from each
classifier, and Fig. 2 demonstrates this finding perfectly.
According to Fig. 2, the best classification mean belongs to
IBk. Then, the other good performing classifiers are
Logistics, C4.5 and MLP. The worst mean value belongs to
MLVQ. Then, the other poor performing classifiers are
QUEST and RSES.
Table 9 shows that the means of the prediction abilities
of the classifiers after discretisations still do not differ
significantly from each classifier, and Fig. 3 demonstrates
this finding perfectly. According to Fig. 3, the top classi-
fiers are RSES, IBk, C4.5 and SVM; the worst classifiers
are QUEST, Ex-CHAID and CART. In the second stage of
the experiment, a general tendency of performance to drop
is observable; however, CSCA, MLVQ, RSES and SVM
show a tendency to increase. RSES and SVM demonstrated
a sharp increase in performance.
Table 7 Best accuracy results for each dataset/after PCA
Dataset Algorithm Accuracy
Acute AIRS2P, C4.5, CSCA, IBk,
LogitBoost, Logistics, MLP, MLVQ,
Naı̈ve Bayes, RSES, SVM
100
Breast cancer LogitBoost 97.5
CPU MLP 91.4
Credits Logistics 82.5
Iris MLP 86.7
Letters RSES 98.5
Segment IBk 85.9
Wine all IBk 58.0
Wine red IBk 60.0
Wine white C4.5 51.9
Table 8 Overall distributions of classifiers across performance
intervals
Algorithm Accuracy Total
Low Middle Good Very good
AIRS2P 8 2 12 8 30
C4.5 2 8 10 10 30
CART 7 10 12 1 30
CSCA 8 5 10 7 30
EXCHAID 7 10 12 1 30
IBk 0 10 10 10 30
Logistics 6 5 9 10 30
LogitBoost 7 5 10 8 30
MLP 4 7 9 10 30
MLVQ 10 5 8 7 30
Naı̈ve Bayes 6 7 10 7 30
QUEST 12 6 11 1 30
RSES 8 6 7 9 30
SVM 2 11 10 7 30
Total 87 97 140 96 420
Table 9 Anova accuracy results
Sum of
squares
df Mean
square
F Sig.
All trials
Between groups 11,404.3 13.0 877.3 2.2 .009
Within groups 162,245.5 406.0 399.6
Basic implementations
Between groups 6,426.6 13.0 494.4 1.2 .307
Within groups 53,099.2 126.0 421.4
After discretisation
Between groups 3,870.1 13.0 297.7 .8 .648
Within groups 46,269.2 126.0 367.2
After PCA
Between groups 4,081.5 13.0 314.0 .7 .768
Within groups 57,154.7 126.0 453.6
114 Inf Technol Manag (2013) 14:105–124
123
Table 9 shows that the means of the prediction abilities
of the classifiers after PCA do not differ significantly from
each other since the means of classifiers come far closer to
each one, and Fig. 4 demonstrates this finding perfectly.
According to Fig. 4, the top classifiers are IBk, C4.5, and
SVM; the worst classifiers are QUEST, Ex-Chaid, and
CART. In the third stage of the experiment, a general
tendency of the performance to drop is still observable.
Although the claim was that a classifier cannot outperform
the others in every dataset, with respect to all these
experimental trials, IBk showed the best performance on
average for all the datasets.
As can be estimated, when all 420 trials are taken into
account, the means for the algorithms will differ signifi-
cantly. Figure 5 supports this finding, revealing the best
classifiers out of all the trials to have been IBk, C4.5 and
MLP. Logistics and LogitBoost also predicted well.
However, QUEST, MLVQ, Ex-CHAID, and CART had the
lowest predictive power.
4.2 Research Question 2: Is the performance
of classifiers significantly affected by dataset
characteristics?
Once all of the iterations were completed in the imple-
mentation step, a dataset of 420 rows, including combi-
nations of the datasets, algorithms, discretisation and PCA
application options with 13 columns for the variables, were
obtained.
The first 11 fields, except for the ‘Trial ID’ in Table 10,
were set as input variables, which were ‘dataset name’,
‘algorithm name’, ‘integer variables binned’, ‘number of
Fig. 2 Anova mean plots/basic stage/accuracy
Fig. 3 Anova mean plots/discretisation stage/accuracy
Fig. 4 Anova mean plots/PCA stage/accuracy
Fig. 5 Anova mean plots/all trials/accuracy
Inf Technol Manag (2013) 14:105–124 115
123
principal components’, ‘PCA applied’, ‘percentage of
cumulative var. obtained in PCA’, ‘number of variables’,
‘number of nominal variables’, ‘number of numerical
variables’, ‘number of target class types’, and ‘number of
instances’. The last fields in Table 10 show the perfor-
mance and CPU time variables, which were set as the
dependent variables. Since the second research question is
interested in dataset characteristics, the independent vari-
ables were defined based on dataset attributes, such as the
number of variables, number of nominal variables, number
of numerical variables, number of target class types, and
number of instances.
On the newly created dataset, referred to as the Results
dataset, some kinds of correlation analysis can be con-
ducted to determine if any of the input variables signifi-
cantly affected performance results. First, in order to
conduct the correlation analysis, all variables were coded
as numerical variables, and Z-score normalizations were
applied to these variables. SPSS was also used for
implementation.
According to Table 11, all the independent variables
were found to be significantly correlated to the dependent
variable or the performance of the classifier. Based on
these results, the number of variables in the dataset (-.290
Pearson value), the number of nominal variables in the
dataset (.280 Pearson value), the number of numerical
variables in the dataset (-.417 Pearson value), the number
of target class types (-.334 Pearson value) and the number
of instances in the dataset (-.456 Pearson value) in the
dataset were found to coincide with the classifier
performance. As a result, the answer to the second question
indicates that all dataset characteristics can affect classifier
performance. In other words, the number of instances,
nominal or numerical variables, target classes in a dataset
makes a significant difference in terms of the classifier
success rate.
Apparently proper preparation of the data is a key factor
in any classification problem. The data must be properly
cleansed to eliminate inconsistencies and support the
requirements of the mining application. Additionally, most
algorithms require some form of data transformation, such
as generalization, normalization in order to reduce the
classification difficulty. Obviously, the dataset becomes
more complex with high number of instances, classes and
variables (attributes) for the classifier to make the predic-
tion. In most classification and regression problems, if the
training and test data is prepared by the tool, normalization
methods are applied to the attributes in order to reduce the
cardinality and increase the discriminating power of the
algorithm.
4.3 Research Question 3: Does binning the continuous
variables in the dataset into discrete intervals affect
classifier accuracy?
According to Table 12, the input variable, ‘the continuous
variables binned into intervals’, has not been found to
significantly correlate to the dependent variable or the
performance of the classifier (-.058 Pearson value). As a
result, the answer to the third question indicates to that
discretisation of the continuous variables in the dataset will
not affect classifier performance.
Further Table 13 shows that the performance means of
instances where continuous variables have been discretised
and instances where continuous variables have not been
discretised are not found to be significantly different with
Table 10 Excerpts from the results dataset
Trial ID 1 … 88 … 420
Dataset name Acute Iris White
wine
Algorithm name AIRS2P CSCA SVM
Implementation attributes
Integer variables binned No Yes Yes
No of principal components 0 1 5
PCA applied N Y Y
Percentage of cumulative
var. obtained in PCA
0 70.99 72.186
Dataset attributes
No of variables 7 4 11
No of nominal variables 6 0 0
No of numerical variables 1 4 11
No of target class types 2 3 7
No of instances 120 150 4898
Class
Accuracy (%) 100 84 49.392
CPU time (s) 0.13 0.05 10.02
Table 11 Correlations between dataset characteristics and accuracy
Dataset characteristics Correlations Accuracy
Number of variables Pearson correlation -.290**
Sig. (2-tailed) .000
Number of nominal variables Pearson correlation .280**
Sig. (2-tailed) .000
Number of numerical variables Pearson correlation -.417**
Sig. (2-tailed) .000
Number of target class types Pearson correlation -.334**
Sig. (2-tailed) .000
Number of instances Pearson correlation -.456**
Sig. (2-tailed) .000
N 420
** Significant correlation
116 Inf Technol Manag (2013) 14:105–124
123
respect to the significance level of the One-way Anova test
(.235).
4.4 Research Question 4: Does applying principal
component analysis (PCA) in the dataset affect
the classifier accuracy?
According to Table 14, the input variables ‘number of
principal components’ (-.289 Pearson value), ‘percentage
of cumulative variance obtained in PCA’ (-.125 Pearson
value) and ‘PCA applied’ (-.126 Pearson value) were
found to be significantly correlated to the dependent vari-
able, the performance of the classifier. As a result, the
answer to the fourth question is that applying PCA in the
dataset can affect classifier performance. Because principal
components act like a variable in the dataset, they have a
similar effect on the accuracy, for as the number of com-
ponents increase, the accuracy can get worse. Additionally,
applying PCA to a dataset will reduce data content as the
classifier prediction accuracy will be lower.
On the other hand, Table 15 shows that the performance
means of instances of which PCA has not been applied and
the ones of which PCA has applied were found to be sig-
nificantly different with respect to the significance level of
the One-way Anova test (.010).
4.5 Research Question 5: Based on the empirical
results of this study (applying classifiers on various
datasets with different implementation techniques),
what is the overall effect of dataset
and implementation attributes on the accuracy
of the classification algorithm?
Since there is a Results dataset that contains the algorithm,
dataset, and implementation specific attributes in Table 10,
it is possible to use these variables in a regression model
and determine their causal effects on the dependent per-
formance variable. After finding the correlations between
some of the selected independent and dependent perfor-
mance variables in the previous research questions, it is
essential to design a regression model; therefore, a
regression model was developed to answer the fifth
research question. According to the regression results, it
was possible to build a model to demonstrate the correla-
tions to the performance result. Equation (1) shows the
regression function for modeling the performance of a
classifier.
Performance ¼ �:552� number of principal componentsþ :189
�% of cumulative variance obtained in PCA
� :099� number of variablesþ :223
� number of nominal variables
þ :117� number of target class types� :367
� number of instances
þ :006� discretisationþ :155
� PCA application
ð1Þ
The purpose of conducting a regression is to understand
whether the coefficients on the independent variables
actually differ from zero; in other words, whether the
independent variables have an observable effect on the
dependent variable. If coefficients are different from zero,
Table 12 Correlations between accuracy and discretisation
Dataset characteristics Correlations Accuracy
Continuous variables binned Pearson correlation -.058
Sig. (2-tailed) .235
N 420
Table 13 Anova results of accuracy and discretisation
Accuracy Sum of squares df Mean
square
F Sig.
Between
groups
584.950 1 584.950 1.413 .235
Within groups 173,064.824 418 414.031
Total 173,649.774 419
Table 14 Correlations between PCA and accuracy
Accuracy
Number of principal components
Pearson correlation -.289**
Sig. (2-tailed) .000
Percentage of cumulative variance obtained in PCA
Pearson correlation -.125**
Sig. (2-tailed) .010
PCA applied
Pearson correlation -.126**
Sig. (2-tailed) .010
N 420
** Significant correlation
Table 15 Anova results of accuracy and PCA
Accuracy Sum of squares df Mean
square
F Sig.
Between
groups
2,743.139 1 2,743.139 6.709 .010
Within groups 170,906.635 418 408.868
Total 173,649.774 419
Inf Technol Manag (2013) 14:105–124 117
123
this means that the null hypothesis (the dependent not
affected by the independents) may be rejected [29]. Based
on the regression function in the Eq. (1), some of the
independent variables were to affect the dependent
variable’s performance. As a result, the number of
principal components, number of instances, and number
of variables all had a negative effect on performance. On
the other hand, the number of nominal variables, the
percentage of cumulative variance obtained in PCA, the
PCA application and the number of target class types all
had a positive effect on the performance.
However, within a 95 % confidence interval, the p val-
ues in Table 16 must be close to or lower than 0.05 in order
to be accepted as significant enough. With respect to
p values (sig. column), the effect of the number of principal
components, the number of nominal variables and the
number of instances on performance is said to be more
certain. These findings are also in line with the research
question 2 findings where we claimed that the high number
of variables and instances increase the classification diffi-
culty and impact the algorithm discriminating power.
4.6 Research Question 6: Implemented using different
techniques, does the speed of classifiers deviate
significantly on multiple datasets?
The basic concern of the sixth research question is to
determine whether the classifiers have significantly differ-
ent CPU time on multiple datasets. A One-way Anova test
can help to perfectly visualize the differences between
classifier speeds. Table 17 shows that the means of the
CPU time of the classifiers on original datasets do not
differ significantly from each other (sig. 0.447) and Fig. 6
demonstrates this finding perfectly. According to Fig. 6,
the slowest classifier is CSCA; MLP is slightly slower than
the rest of the classifiers and the remaining classifiers are
all of a high speed nature.
Table 17 shows that the difference in the means of the
CPU time of the classifiers after discretisations are not
significant, and Fig. 7 demonstrates this finding perfectly.
According to Fig. 7, an overall increase in the time spent is
observable for all classifiers. The slowest classifier is still
CSCA; the rest of the classifiers display similar low CPU
Table 16 Regression results/
accuracyModel Unstd. coefficient Std. coefficient t Sig.
B SE Beta
Constant .000 .040 .000 1.000
Number of principal components -.552 .090 -.552 -6.152 .000
Percentage of cumulative
variance obtained in PCA
.189 .315 .189 .598 .550
Number of variables -.099 .055 -.099 -1.809 .071
Number of nominal variables .223 .054 .223 4.095 .000
Number of target class types .117 .105 .117 1.113 .266
Number of instances -.367 .097 -.367 -3.784 .000
Discretisation .006 .046 .006 .138 .890
PCA application .155 .317 .155 .488 .626
Table 17 Anova CPU time
resultsSum of squares df Mean square F Sig.
All trials
Between groups 3.450E ? 08 13 2.654E ? 07 3.075 .000
Within groups 3.504E ? 09 406 8.631E ? 07
Basic implementations
Between groups 1.936E ? 07 13 1.489E ? 06 1.009 .447
Within groups 1.860E ? 08 126 1.476E ? 06
After discretisation
Between groups 2.328E ? 08 13 1.791E ? 07 1.116 .352
Within groups 2.007E ? 09 125 1.605E ? 07
After PCA
Between groups 1.576E ? 08 13 1.212E ? 07 1.232 .265
Within groups 1.240E ? 09 126 9.843E ? 06
118 Inf Technol Manag (2013) 14:105–124
123
times. In the second stage of the experiment, a general
tendency of the CPU time to increase is observable.
Table 17 shows that the means of the CPU time of the
classifiers after PCA also do not differ significantly from
each other and Fig. 8 demonstrates this finding perfectly.
According to Fig. 8, the slowest classifier is still CSCA,
and the other classifiers are a lower CPU time. In the third
stage of the experiment, a general tendency of the speed to
increase is observable.
As can be estimated, when all 420 trials are taken into
account, the means of the algorithms will differ signifi-
cantly. Figure 9 shows this finding well and reveals that the
slowest classifier is always CSCA. Different implementa-
tion techniques significantly change the overall picture for
classifier CPU time.
4.7 Research Question 7: Based on the empirical
results of this study (applying classifiers on various
datasets with different implementation techniques),
what is the overall effect of dataset
and implementation attributes on speed (consumed
CPU time in seconds) for the classification
algorithm?
It was deemed reasonable to design a regression model;
therefore, a regression model was developed to answer the
seventh research question. According to the regression
results in Table 18, it is possible to build a model to show
the factors that do affect the speed altogether.
Within a 95 % confidence interval, p values in Table 18
should be close to or lower than 0.05 in order to be
accepted as significant enough. With respect to p values
(sig. column), only the effect of the number of instances on
actual CPU time is more certain. In other words, as the
number of instances increase, the time to build the model
0
200
400
600
800
1000
1200
1400
Fig. 6 Anova mean plots/basic implementations/CPU time
Fig. 7 Anova mean plots/after discretisations/CPU time
Fig. 8 Anova mean plots/after PCA/CPU time
Fig. 9 Anova mean plots/all trials/CPU time
Inf Technol Manag (2013) 14:105–124 119
123
will take longer. This is an expected outcome of the
experiments because for most of the algorithms it is known
that more time and resources are required as the number of
samples increase. For example, in a study, for decision
trees, moving from 10 to 100 K cases increased CPU time
on a PC from 1.4 to 61 s, a factor of 44. The time required
for rule sets, however, increased from 32 to 9,715 s, a
factor of 300. For another example, KNN algorithm needs
N 9 k comparisons. The number of iterations required for
convergence varies and may depend on N, but initially, this
algorithm can be considered linear in the dataset size [64].
The negative effect here implies a reduction in CPU
time, which means it takes less time to build the model, and
a positive effect on the CPU time, which means an
increased time needed to build the model.
4.8 Research Question 8: Does the speed
of the algorithm (CPU time consumed in seconds)
significantly affect the accuracy
of the classification algorithm?
Research Question 8 seeks an answer to the effects of the
speed of a classifier on the accuracy of that classifier. In
other words, the questions tries to understand if the
accuracy rate of algorithm changes based on its being
classifying faster or slower. To answer this question, a
correlation analysis was conducted in SPSS by analyzing
the relation between the ‘Accuracy’ and ‘CPU time’
variables. Based on the results shown in Table 19 (Pear-
son value -0.041), we cannot say that there is a signifi-
cant correlation between performance and the speed of a
classifier; therefore we cannot make an assumption that
the accuracy of an algorithm is reduced if it takes too
much or even less time to complete the prediction
process.
4.9 Research Question 9: Do the abilities of classifiers
to handle missing or noisy data differ?
To answer this last research question, ‘Breast Cancer’ (0
nominal variables, 9 numerical variables, 2 target classes,
684 instances), ‘Credits’ (9 nominal variables, 6 numerical
variables, 2 target classes, 653 instances) and ‘Pittsburg
Bridges’ (10 nominal variables, 2 numerical variables, 6
target classes, 92 instances) datasets were selected for the
experiments. These datasets have some noise, i.e., missing
values. First, all 14 algorithms with tenfold cross-valida-
tion were applied to the datasets, and accuracy results were
tabulated. Then, the missing values were cleaned from the
datasets, and the same algorithms were implemented once
again. The results of the two different steps can lead to one
determining which algorithms are more robust.
Table 20 shows the performance of classifiers before a
missing value analysis and accuracies after the missing
values are cleaned. Based on Table 20, the accuracy of
classifiers does change after the missing values are cleaned.
Based on the average variance in the bottom line of
Table 20, the ‘Pittsburg’ dataset had the highest average
deviation value (3.77). This result is a point to be consid-
ered. The most outstanding feature of the ‘Pittsburg’
dataset is having the lowest amount of instances and the
highest amount of target class types, thus indicating a
difficult classification problem. The proportion of missing
values to the total number of instances is high compared to
the other two datasets; therefore, the effect of missing data
Table 18 Regression results/
CPU timeModel Unstd. coefficient Std. coefficient T Sig.
B SE Beta
(Constant) 310.0 143.8 2.2 .032
Number of principal components -132.8 321.0 .0 -.4 .679
Percentage of cumulative variance obtained in PCA 217 1,128.7 .1 .2 .848
Number of variables -81.8 195.2 .0 -.4 .675
Number of nominal variables 33.0 194.7 .0 .2 .865
Number of target class types 219.1 377.0 .1 .6 .562
Number of instances 680.1 346.7 .2 2.0 .050
Integers are binned 143.5 166.3 .0 .9 .389
PCA applied -164.2 1,135.2 -.1 -.1 .885
Table 19 Accuracy and CPU time correlation
Accuracy CPU time
Pearson correlation -.041
Sig. (2-tailed) .400
N 420
120 Inf Technol Manag (2013) 14:105–124
123
on the results is also higher. In other words, algorithms
classify more robustly on larger datasets.
The highest deviation in accuracy after the missing
values were cleaned was obtained for the RSES algorithm
(3.8), the least robust classifier based on the experiments.
The algorithm affected by the missing values least seems to
have been the CART (0.70) which handled the missing
values best. The second most robust algorithm seems to
have been Ex-CHAID (0.9) as it seems that the decision
tree algorithms are relatively very robust.
However, we did not observe any other common pattern
or behavior; thus certainly, more experiments should be
conducted to draw more certain conclusions about classi-
fier robustness on noisy datasets.
5 Discussion
The first discussion point for the study is the significant
differences between algorithm success rates. According to
the accuracy results, none of the algorithms outperformed
the others on each dataset; therefore, an algorithm may not
dominantly predict the best in all domains; data miners
should also consider dataset bias. The distribution of
algorithms, such as C4.5 and IBk, is mostly in the ‘Very
Good’ interval; AIRS2P, MLP and Logistics are the other
‘Good’ algorithms. IBk is the only classifier with no result
in the ‘Low’ interval.
When all trials are taken into account, the means of
algorithms will differ significantly. The best classifiers out
of all the trials were IBk, C4.5, and MLP. Logistics and
LogitBoost also predicted well. However, QUEST, MLVQ,
Ex-CHAID, and CART had the lowest predictive power.
The nature of each experiment stage helped us explain
the differences between algorithm performances. For
example, in the basic implementations step, the best clas-
sification means belonged to IBk, C4.5, Logistics, and
MLP; the worst mean values belonged to MLVQ, QUEST,
and RSES. This result means that these classifiers cannot
handle continuous variables and dense dimensionality as
well as MLP, IBk, C4.5, or Logistics can. When the con-
tinuous variables were binned into intervals, a general
tendency of the performance to drop was observable;
however, CSCA, MLVQ, RSES, and SVM showed a ten-
dency to increase. RSES and SVM demonstrated a sharp
increase in performance. This finding depends on their
ability to handle discrete values better. After PCA appli-
cation, a general tendency of the performance to drop was
still observable. Thus, data analysts should be aware that
some data preprocessing attempts may reduce accuracy for
some classifiers. Although it was claimed that a classifier
cannot outperform the others in every dataset, with respect
to all these trials, IBk showed the best performance on
average of all the datasets.
Another debate in this study is about success rates and
possible affecting factors. Based on the correlation analy-
sis, all dataset characteristics and PCA do affect the clas-
sifier accuracy. This finding makes sense because PCA
does affect reducing the number of variables in the dataset.
However, discretisation has no significant effect on the
success rates; in other words, having continuous or discrete
values does not change the accuracy significantly. We
Table 20 Robustness comparisons
Algorithms Datasets Avg. Var.
Breast cancer Credits Pittsburg bridges
Before MVA After MVA Var. Before MVA After MVA Var. Before MVA After MVA Var.
AIRS2P 96.1 96.8 0.6 84.3 84.2 0.1 52.8 59.7 6.9 2.5
C4.5 94.6 96.0 1.4 86.1 85.3 0.8 63.2 67.3 4.1 2.1
CART 92.4 92.7 0.3 87.1 87.4 0.3 40.7 39.1 1.6 0.7
CSCA 96.7 96.3 0.4 65.0 65.4 0.4 31.5 33.6 2.1 1.0
Ex-CHAID 92.7 93.0 0.3 87.2 86.4 0.8 40.7 39.1 1.6 0.9
IBk 95.1 95.7 0.6 81.2 81.2 0 58.4 60.8 2.4 1.0
Logistics 96.6 96.8 0.2 85.2 86.1 0.9 55.6 60.8 5.2 2.1
LogitBoost 95.7 95.3 0.4 84.9 86.5 1.6 67.9 70.6 2.7 1.6
MLP 95.3 96.0 0.7 82.8 82.7 0.1 65 70.6 5.6 2.1
MLVQ 96.5 97.2 0.7 68.5 67.8 0.7 45.2 51 5.8 2.4
Naı̈ve Bayesian 96.0 96.3 0.3 77.7 78.3 0.6 66.9 64.1 2.8 1.2
QUEST 91.0 91.2 0.2 83.6 84.8 1.2 40.7 39.1 1.6 1.0
RSES 94.5 96.0 1.5 38.8 38.7 0.1 33.7 43.6 9.9 3.8
SVM 66.3 68.8 2.5 54.8 54.7 0.1 53.7 53.2 0.5 1.0
Avg 0.72 Avg 0.55 Avg 3.77
Inf Technol Manag (2013) 14:105–124 121
123
claimed that the high number of variables and instances
increase the classification difficulty and impact the algo-
rithm discriminating power.
The third discussion that the paper focuses on is the
classifier speed. Based on the Anova mean plots, the CSCA
algorithm has always been found to be the most time-
consuming algorithm; the rest are of a similar speed. This
part of the study offers a clear sense of model development
times, since the data analyst can indeed understand that
more training time will be required with CSCA based on
these findings.
Based on the findings of this study, a regression model
was also built to show the overall effects of dataset and
implementation variables on the performance and speed of a
classifier. Obviously, other factors may also influence the
accuracy or speed of a classifier. Thus, the input variables of
the regression function should be increased in the future. In
this study, they are based on dataset attributes, applying of
discretisation and PCA; however, it is suggested that those
additional quality factors be included in any future studies.
Lastly, the robustness comparisons were conducted on
multiple datasets before and after the missing values were
cleaned. The results show that the accuracy of algorithms is
not dramatically reduced when noise is also included;
however, CART was the most robust one compared to
RSES as the least robust, and the effects of missing data on
accuracy was higher on smaller datasets.
Another question raised in this study seeks to find an
answer to the trade-off dilemma between the success rate and
the CPU time consumed by the algorithms. In other words,
does having a longer or shorter classification time increase or
decrease accuracy? However, based on the correlation
studies, no significant relationship between speed and suc-
cess rate was found. We cannot claim that algorithms that
take a longer time will decrease or increase accuracy results.
In this study, the algorithms were compared to each
other based on their total success rates at prediction;
however, the evaluations discussed so far here do not take
into account the cost of making wrong decisions, wrong
classifications. Optimizing classification rate without con-
sidering the cost of the errors could lead to different results.
If the costs are known, however, they can be incorporated
into a financial analysis of the decision-making process.
Even though the costs of the errors can be different,
evaluation by classification accuracy tacitly assumes equal
error costs. Taking the cost matrix into account, however,
replaces the success rate with the average cost per decision.
This is how a classifier, built without taking costs into
consideration, can be used to make predictions that are
sensitive to the specific cost matrix. An alternative is to
take the cost matrix into account during the training pro-
cess and ignore costs at prediction time. [63]. Because the
misclassification costs can be different for each dataset
depending on the nature of the data and target classes to be
predicted, in real life situations these factors should also be
taken into account while conducting a data mining practice.
6 Conclusion
Classification algorithms have been very frequently used
by the data mining community, and their prediction abili-
ties or complexities have been often discussed. When
implemented efficiently and correctly, data mining systems
can be crucial in many areas, such as customer relationship
management, fraud detection, credit evaluation, risk eval-
uations, medical treatment, and disease detection. There-
fore, quality criteria like accuracy and speed do play a
crucial role in data mining projects when selecting a proper
classifier.
In this study, the AIRS2P, C4.5, CART, CSCA, Ex-
CHAID, IBk, Logistics, LogitBoost, MLP, MLVQ, Naı̈ve
Bayesian, QUEST, RSES, and SVM classification algo-
rithms were implemented on ten different datasets. The
factors affecting classification algorithm performance and
speed were underscored based on the empirical results of
the difference test, correlation, and regression studies. The
fact that dataset characteristics and implementation details
may influence the accuracy of an algorithm cannot be
denied. In summary, all dataset characteristics and PCA
applications were found to affect the success rate signifi-
cantly, but not the discretisation. Speed or the amount of
CPU time consumed by the classifier has been another
concern. CSCA has been found the most time taking
algorithm and number of instances in a dataset can be
claimed to increase that time significantly.
On the other hand, misclassification costs should be
applied and studied in further experiments, because dif-
ferent results can be obtained in real life applications if the
costs for errors are closely considered. Additionally, future
studies should be done on larger datasets to observe their
algorithm behaviors and scalability.
Maimon and Rokach claim that ‘‘no induction algorithm
can be best in all possible domains’’, and they introduce the
concept of ‘‘No Free Lunch Theorem’’. This concept states
that ‘‘if one inducer is better than another in some domains,
then there are necessarily other domains in which this
relationship is reversed’’ [45]. Data miners face the
dilemma of determining which classifier to use and that
situation gets harder when other criteria, such as compre-
hensibility or complexity, are also a concern. It is not an
easy task to decide which classifier to use in a data mining
problem; however, this study shows the importance of
model selection and explains that an algorithm and the data
preprocessing technique is not always the best choice for
all datasets. As a result, the nature of the data determines
122 Inf Technol Manag (2013) 14:105–124
123
which classification algorithm will provide the best solu-
tion to a given problem. The algorithm can differ with
respect to accuracy, time to completion, and comprehen-
sibility. In practice, it will make sense to develop several
models for each algorithm, select the best model for each
algorithm, and then choose the best of those for
implementation.
The scope of the selected algorithms studied here
included a large portion of modern algorithms which were
not very easy to find in the literature review. This paper
studied a broad range of well-known traditional algorithms
and recent popular algorithms with a combination of dif-
ferent implementation settings (such as discretisation,
principal component analysis) on multiple datasets. Addi-
tionally, we believe that this paper statistically demon-
strates the relationships between the success rate of the
classifiers and the dataset attributes as well as other settings
(such as discretisation and PCA). The business and aca-
demic community should take these results into consider-
ation, since establishing a knowledge discovery process on
the same algorithm with the same implementation details
may not always be either certain or efficient. The utmost
attention should be paid to model assessment and the
selection phase in an iterative manner, because any dif-
ference in dataset characteristics or preprocessing tech-
niques can affect a model’s accuracy or speed. Hence
switching to another classifier or changing the prepro-
cessing technique may be a better decision. The regression
models developed in this study also offer some hints to data
miners about the expected accuracy and speed of classifiers
based on either a given dataset or its preprocessing
characteristics.
References
1. Abbasi A, Chen H (2009) A comparison of fraud cues and
classification methods for fake escrow website detection. Inf
Technol Manage 10(2–3):83–101
2. Aha DW, Kibler D, Albert MK (1991) Instance-based learning
algorithms. Mach Learn 6(1):37–66
3. Al-Sheshtawi KA, Abdul-Kader HM, Ismail NA (2010) Artificial
immune clonal selection classification algorithms for classifying
malware and benign processes using API call sequences. Int J
Comput Sci Netw Secur 10(4):31–39
4. Armstrong LJ, Diepeveen D, Maddern R (2007) The application
of data mining techniques to characterize agricultural soil pro-
files. In: Proceedings of the 6th Australasian conference on data
mining and analytics (AusDM’07), Gold Coast, Australia,
pp 81–95
5. Badulescu LA (2007) The choice of the best attribute selection
measure in decision tree induction. Ann Univ Craiova Math
Comp Sci Ser 34(1):88–93
6. Bazan JG (1998) A comparison of dynamic and non-dynamic
rough set methods for extracting laws from decision tables.
In: Polkowski L, Skowron A (eds) Rough sets in knowledge
discovery: methodology and applications. Physica-Verlag, Hei-
delberg, pp 321–365
7. Bazan JG, Szczuka M (2001) RSES and RSESlib - A collection
of tools for rough set computations. In: Ziarko W, Yao Y (eds)
Proceedings of the 2nd international conference on rough sets and
current trends in computing (RSCTC’2000). Lecture Notes in
Artificial Intelligence 2005. Springer, Berlin, pp 106–113
8. Berson A, Smith S, Thearling K (2000) Building data mining
applications for CRM. McGraw Hill, USA
9. Bigss D, Ville B, Suen E (1991) A method of choosing multiway
partitions for classification and decision trees. J Appl Stat 18(1):
49–62
10. Boser BE, Guyon IM, Vapnik VN (1992) A training algorithm for
optimal margin classes. In: Proceedings of the 5th annual work-
shop on computational learning theory, Pittsburg, USA,
pp 144–152
11. Brazdil PB, Soares C, Costa JP (2003) Ranking learning algo-
rithms: using IBL and meta-learning on accuracy and time
results. Mach Learn 50(3):251–277
12. Breiman L, Friedman JH, Olshen R, Stone CJ (1984) Classifi-
cation and regression tree. Wadsworth & Brooks/Cole Advanced
Books & Software, Pacific Grove
13. Brownlee J (2005) Clonal selection theory & CLONALG: the
clonal selection classification algorithm (CSCA). Technical
Report 2-02, Centre for Intelligent Systems and Complex Pro-
cesses (CISCP), Swinburne University of Technology (SUT),
Australia. Available via http://citeseerx.ist.psu.edu/viewdoc/
download?doi=10.1.1.70.2821&rep=rep1&type=pdf. Accessed
on 15 August 2011
14. Castro LN, Zuben FJ (2002) Learning and optimization using the
clonal selection principle. IEEE Trans Evol Computat 6(3):
239–251
15. Chang CC, Lin CJ (2011) LIBSVM: a library for support vector
machines. ACM Trans Intell Syst Technol 2(3):27:1–27. Soft-
ware available at http://www.csie.ntu.edu.tw/*cjlin/libsvm.
Accessed on 29 August 2011
16. Chiarini TM, Berndt JD, Luther SL, Foulis PR, French DD
(2009) Identifying fall-related injuries: text mining the electronic
medical record. Inf Technol Manage 10(4):253–265
17. Cios KJ, Pedrycz W, Swiniarski RW, Kurgan LA (2007) Data
mining: a knowledge discovery approach. Springer, USA
18. Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn
20(3):273–297
19. Dogan N, Tanrikulu Z (2010) A comparative framework for
evaluating classification algorithms. In: Proceedings of WCE
2010: international data mining and knowledge engineering 1,
London, UK, pp 309–314
20. Dunham MH (2002) Data mining: introductory and advanced
topics. Prentice Hall, New Jersey
21. EL-Manzalawy Y, Honavar V (2005) WLSVM: integrating Lib-
SVM into weka environment. Software available at http://www.
cs.iastate.edu/*yasser/wlsvm. Accessed on 29 August 2011
22. Friedman J, Hastie T, Tibshirani R (2000) Additive logistic
regression: a statistical view of boosting. Ann Stat 28(2):337–407
23. Ge E, Nayak R, Xu Y, Li Y (2006) Data mining for lifetime
prediction of metallic components. In: Proceedings of the 5th
Australasian data mining conference (AusDM2006), Sydney,
Australia, pp 75–81
24. Hacker S, Ahn LV (2009) Matchin: eliciting user preferences
with an online game. In: Proceedings of the 27th international
conference on human factors in computing systems (CHI’09),
pp 1207–1216
25. Han J, Kamber M (2006) Data mining concepts and techniques,
2nd edn. Morgan Kaufmann, USA
26. Hand D, Mannila H, Smyth P (2001) Principles of data mining.
The MIT Press, Cambridge
Inf Technol Manag (2013) 14:105–124 123
123
27. He H, Jin H, Chen J, McAullay D, Li J, Fallon T (2006) Analysis
of breast feeding data using data mining methods. In: Proceedings
of the 5th Australasian data mining conference (AusDM2006),
Sydney, Australia, pp 47–53
28. Hergert F, Finnoff W, Zimmermann HG (1995) Improving model
selection by dynamic regularization methods. In: Petsche T,
Hanson SJ, Shavlik J (eds) Computational learning theory and
natural learning systems: selecting good models. MIT Press,
Cambridge, pp 323–343
29. Hill T, Lewicki P (2007) STATISTICS: methods and applica-
tions. StatSoft, Tulsa
30. Howley T, Madden MG, OConnell ML, Ryder AG (2006) The effect
of principal component analysis on machine learning accuracy with
high dimensional spectral data. Knowl-Based Syst 19(5):363–370
31. Jamain A, Hand DJ (2008) Mining supervised classification per-
formance studies: a meta-analytic investigation. J Classif 25(1):
87–112
32. John GH, Langley P (1995) Estimating continuous distributions
in Bayesian classifiers. In: Besnard P, Hanks S (eds) Proceedings
of the 17th conference on uncertainty in artificial intelligence.
Morgan Kaufmann, USA, pp 338–345
33. Kaelbling LP (1994) Associative methods in reinforcement
learning: an emprical study. In: Hanson SJ, Petsche T, Kearns M,
Rivest RL (eds) Computational learning theory and natural
learning systems: intersection between theory and experiment.
MIT Press, Cambridge, pp 133–153
34. Kalousis A, Gama J, Hilario M (2004) On data and algorithms:
understanding inductive performance. Mach Learn 54(3):275–312
35. Keogh E, Kasetty S (2002) On the need for time series data
mining benchmarks: a survey and empirical demonstration. Data
Min Knowl Disc 7(4):349–371
36. Keogh E, Stefano L, Ratanamahatana CA (2004) Towards
parameter-free data mining. In: Kim W, Kohavi R, Gehrke J,
DuMouchel W (eds) Proceedings of the 10th ACM SIGKDD
international conference on knowledge discovery and data min-
ing. Seattle, Washington, pp 206–215
37. Kim SB, Han KS, Rim HC, Myaeng SH (2006) Some effective
techniques for Naive Bayes text classification. IEEE Trans Knowl
Data Eng 18(11):1457–1466
38. Ko M, Osei-Bryson KM (2008) Reexamining the impact of
information technology investment on productivity using
regression tree and multivariate adaptive regression splines
(MARS). Inf Technol Manage 9(4):285–299
39. Kohonen T (1990a) Improved versions of learning vector quan-
tization. In: Proceedings of the international joint conference on
neural networks I, San Diego, pp 545–550
40. Kohonen T (1990) The self-organizing map. Proc IEEE 78(9):
1464–1480
41. Kohonen T, Hynninen J, Kangas J, Laaksonen J, Torkkola K
(1995) LVQ-PK: the learning vector quantization package ver-
sion 3.1. Technical Report. Helsinki University of Technology
Laboratory of Computer and Information Science, Finland.
Available via http://www.cis.hut.fi/research/lvq_pak/lvq_doc.txt.
Accessed on 27 March 2012
42. Le Cessie S, van Houwelingen JC (1992) Ridge estimators in
logistic regression. Appl Stat 41(1):191–201
43. Lim T-S, Loh W-Y, Shih Y-S (2000) A comparison of prediction
accuracy, complexity, and training time of thirty-three old and
new classification algorithms. Mach Learn 40(3):203–228
44. Loh WY, Shih YS (1997) Split selection methods for classifica-
tion trees. Stat Sin 7(4):815–840
45. Maimon O, Rokach L (2010) The data mining and knowledge
discovery handbook. Springer, Berlin, 2nd edn
46. Maindonald J (2006) Data mining methodological weaknesses
and suggested fixes. In: Proceedings of the 5th Australasian data
mining conference (AusDM2006), Sydney, Australia, pp 9–16
47. Mitchell TM (1997) Machine learning. McGraw Hill, USA
48. Pawlak Z (1982) Rough sets. Int J Parallel Prog 11(5):341–356
49. Pawlak Z (1991) Rough sets: theoretical aspects of reasoning
about data. Kluwer, Dordrecht
50. Pitt E, Nayak R (2007) The use of various data mining and
feature selection methods in the analysis of a population survey
dataset. In: Ong KL, Li W, Gao J (eds) Proceedings of the 2nd
international workshop on integrating artificial intelligence and
data mining (AIDM 2007) CRPIT 87. Goald Coast, Australia,
pp 87–97
51. Putten P, Lingjun M, Kok JN (2008) Profiling novel classification
algorithms: artificial immune system. In: Proceedings of the 7th
IEEE international conference on cybernetic intelligent systems
(CIS 2008), London, UK, pp 1–6
52. Quinlan JR (1993) C4.5: programs for machine learning. Morgan
Kaufmann, USA
53. Quinlan JR (1994) Comparing connectionist and symbolic
learning methods. In: Hanson SJ, Drastal GA, Rivest RL (eds)
Computational learning theory and natural learning systems:
constraints and prospect. MIT Press, Cambridge, pp 445–456
54. Rumelhart DE, Hinton GE, Williams RJ (1986) Learning internal
representations by error propagation. In: Rumelhart DE, McC-
lelland JL, and the PDP research group (eds) Parallel distributed
processing: explorations in the microstructure of cognition 1:
foundations. MIT Press, Cambridge, pp 318–362
55. Shih YS (2004) QUEST user manual. Department of Mathe-
matics, National Chung Cheng University, Taiwan. Available via
http://www.stat.wisc.edu/*loh/treeprogs/quest/questman.pdf.
Accessed on 15 August 2011
56. SPSS (2012) CHAID and exhaustive CHAID algorithms. Available
via ftp://ftp.software.ibm.com/software/analytics/spss/support/
Stats/Docs/Statistics/Algorithms/13.0/TREE-CHAID.pdf. Acces-
sed on 9 April 2012
57. Su CT, Hsiao YH (2009) Multiclass MTS for simultaneous fea-
ture selection and classification. IEEE Trans Knowl Data Eng
21(2):192–204
58. UCI Machine Learning Repository (2010) Available via http://
archive.ics.uci.edu/ml. Accessed on 10 December 2010
59. Watkins AB (2001) AIRS: a resource limited artificial immune
classifier. M.Sc. Thesis, Mississippi State University Department
of Computer Science, USA. Available via http://www.cse.mss
mtate.edu/*andrew/research/publications/watkins_thesis.pdf. Acc-
essed on 15 August 2011
60. Watkins AB (2005) Exploiting immunological metaphors in the
development of serial, parallel and distributed learning algorithms.
PhD Thesis, University of Kent, Canterbury, UK Available via
http://www.cse.msstate.edu/*andrew/research/publications/
watkins_phd_dissertation.pdf. Accessed on 15 August 2011
61. Watkins AB, Timmis J, Boggess L (2004) Artificial immune
recognition system (AIRS): an immune-inspired supervised
learning algorithm. Genet Program Evolvable Mach 5(3):291–
317
62. WEKA (2011) Classification algorithms. Available via http://
wekaclassalgos.sourceforge.net/. Accessed on 15 August 2011
63. Witten IH, Frank E (2005) Data mining: practical machine
learning tools and techniques, 2nd edn. Morgan Kaufmann, USA
64. Wu X, Kumar V (2009) The top ten algorithms in data mining.
Chapman&Hall/CRC Press, Taylor &Francis Group, USA
65. Yang Y, Webb GI, Cerquides J, Korb KB, Boughton J, Ting KM
(2007) To select or to weigh: a comparative study of linear
combination schemes for superparent-one-dependence estima-
tors. IEEE Trans Knowl Data Eng 19(12):1652–1664
66. Zhu D, Premkumar G, Zhang X, Chu C-H (2001) Data mining for
network intrusion detection: a comparison of alternative methods.
Decis Sci 32(4):635–660
124 Inf Technol Manag (2013) 14:105–124
123