On XCSR for electronic fraud detection

SPECIAL ISSUE

On XCSR for electronic fraud detection

Mohammad Behdad • Luigi Barone •

Tim French • Mohammed Bennamoun

Received: 7 January 2012 / Accepted: 8 May 2012 / Published online: 22 May 2012

� Springer-Verlag 2012

Abstract Fraud is a serious problem that costs the

worldwide economy billions of dollars annually. However,

fraud detection is difficult as perpetrators actively attempt

to mask their actions, among typically overwhelming large

volumes of, legitimate activity. In this paper, we investi-

gate the fraud detection problem and examine how learning

classifier systems can be applied to it. We describe the

common properties of fraud, introducing an abstract

problem which can be tuned to exhibit those characteris-

tics. We report experiments on this abstract problem with

a popular real-time learning classifier system algorithm;

results from our experiments demonstrating that this

approach can overcome the difficulties inherent to the fraud

detection problem. Finally we apply the algorithm to a real-

world problem and show that it can achieve good perfor-

mance in this domain.

Keywords Genetics-based machine learning �Learning classifier systems � XCSR � Fraud detection �Network intrusion detection

1 Introduction

Fraud is the general term used to describe actions under-

taken by one party to obtain an unjust advantage over

another. Typically, the perpetrators, using false or mis-

leading representations, attempt to disguise their activities

in an effort to maximise the effects of their fraudulent

behaviour. Fraud is a serious problem. Indeed, it was

estimated that the cost of personal fraud for Australians in

2007 was $980 million [1]. This research deals with the

growing problem of electronic fraud (e-fraud)—fraud

perpetrated using electronic technologies; the most com-

mon forms including network intrusion, email spam, online

credit card fraud, and telecommunications fraud.

Fraud detection is the task of attempting to detect and

recognise fraudulent activities as they occur, typically by

flagging suspicious behaviour for review and further action

by the appropriate authorities. However, fraud detection is

difficult to approach using automated methods since most

perpetrators actively try to undermine any attempt to

classify their activities as fraud. Fraud detection hence

typically corresponds to an adversarial environment.

Previous work on automated fraud detection has

focused on specific domains such as spam [14] and net-

work intrusion [21]. In this paper, we identify some

general characteristics of e-fraud detection and examine

the performance of learning classifier systems (LCSs) on

abstract problems that exhibit these characteristics indi-

vidually. In order to be able to analyse the behaviour of

LCS on each characteristic independently, we emulate

simple fraud detection problems using a parameterisable

random Boolean function. We also evaluate the perfor-

mance of LCSs on a real-world problem (KDD Cup 1999

intrusion detection [23]); results confirming the efficacy

of this approach.

M. Behdad (&) � L. Barone � T. French � M. Bennamoun

School of Computer Science and Software Engineering,

The University of Western Australia, 35 Stirling Highway,

Crawley, WA 6009, Australia

e-mail: [email protected]

L. Barone


T. French


M. Bennamoun


123

Evol. Intel. (2012) 5:139–150

DOI 10.1007/s12065-012-0076-5

In the next section, we discuss the other works done

which are related to this research. In Sect. 3 we examine

the abstract properties of fraudulent behaviour, defining

characteristic properties we typically expect to observe in

such domains. In Sect. 4 we present background informa-

tion about LCSs and discuss how they might be used for

fraud detection in dynamic environments. In Sect. 5 we

present experiments that explore the performance of

XCSR, a continuous LCS variant, on representative prob-

lems that captures the defining characteristic properties

introduced in Sect. 3, including a real-world problem.

While previous work has explored LCS behaviour on data-

sets exhibiting these characteristics [6, 16], this work

focuses on fraud-like environments as a whole, presenting

a consistent and complete analysis of accuracy-based LCSs

in these domains. Using a systematic approach and com-

prehensive simulation experiments, this work not only

reveals the potential of XCSR to fraud detection, but also

clarifies the specific difficulties for such application.

Additionally, this is the first work to decompose the con-

cept of imbalanced environments into different types,

better exploring the behaviour of XCSR on data-sets of this

form. Finally, Sect. 6 concludes this work.

2 Related works

A range of different techniques have been used to tackle

the problem of e-fraud detection, some of which are as

follows: Artificial immune systems (AIS) are used by

Oda et al. [14] for email spam detection. As the immune

system in the body distinguishes between self and non-

self, an artificial immune system in their spam detection

system distinguishes between a self of legitimate email

(non-spam) and a non-self of spam. A major advan-

tage of AIS is its adaptability. The spammers always

try to change their methods, and AIS is good at adapting

to such changes and still being able to detect spam

emails.

The ant colony optimization (ACO) algorithms which

are based on the ability of ants in finding the shortest paths

have been used in a few fraud detection applications. One

example is the work done by Feng et al. [10] which uses

ACO clustering alongside with dynamic self-organizing

maps (DSOM) for network anomaly detection. Their

method produces the normal cluster using DSOM and ACO

by receiving normal instances during training phase. Then,

during test phase, anomaly data clusters are identified by

normal cluster ratio. The experiment results confirm the

efficiency of this approach in detecting unknown intrusions

in the real network connections.

LCS and its different versions have also been used for

e-fraud detection. Shafi et al. [21] analyse two LCS

variants, namely XCS and UCS, on an intrusion detection

datasets and propose some modifications to these methods

in order to get better performances in such complex

environments.

Surveys and reviews have been done on different types

of fraud detection and techniques used in this regard. Kou

et al. [11] review different techniques used in credit card

fraud detection, telecommunication fraud detection, and

computer intrusion detection. Tsai et al. [27] have a review

paper which discusses the applications of machine learning

techniques to intrusion detection problems. They review 55

related studies in the period between 2000 and 2007

focusing on developing single, hybrid, and ensemble

classifiers. These systems are compared by their classifier

design, datasets used, and other experimental setups.

Additionally, the strengths and limitations of machine

learning in developing intrusion detection systems are

discussed. Financial statement fraud has been reviewed by

Yue et al. [34]. The data mining algorithms developed for

extracting relevant knowledge from the large amount of

data which we face in such fraud fields are discussed. They

also address the likelihood and effectiveness of fraud

detection, feature selection, suitable algorithms for this

purpose, and performance measures. In order to do these

analyses, they present a framework.

There are a few surveys and reviews on LCS. Larry Bull

[4] has one of the earliest introductory on this topic. He

introduces Holland’s original LCS and explains how it

works. Then, he introduces Wilson’s ZCS (strength-based

LCS) and XCS (accuracy-based LCS), two improved ver-

sions of LCS. At the end, he mentions some of the appli-

cations of LCSs.

Sigaud and Wilson [22] have a more recent and more

comprehensive survey. They start with a background and

explain briefly the mechanisms of ‘‘genetic algorithms’’,

‘‘Markof decision process’’ and ‘‘reinforcement learn-

ing’’ which are the centrepieces of LCS. Among things

discussed in the article are a brief history of LCS, the

fundamental aspects of LCS (concepts such as a classifier

and action selection procedure) and three major LCS

versions: ZCS, XCS and ALCS (an anticipation-based

LCS).

The last review paper on LCS to mention is written by

Urbanowicz and Moore [28]. It introduces the basic LCS

concepts, the variation represented in different versions of

LCS and a comprehensive historical perspective of its

growth. The article contains a very comprehensive and

useful table of all the different versions of LCS developed

along with some key information about each, such as the

type of rule representation used and the type of problem it

is aimed at solving. It provides an excellent starting point,

even for non-computer scientists, to learn about LCS and

how to use it in a system.

140 Evol. Intel. (2012) 5:139–150

123

3 Properties of fraudulent activity

Fraud detection is an extremely difficult task. As is evi-

denced by the ongoing large financial losses due to

fraudulent behaviour [1], no general off-the-shelf solution

exists for this problem. The reason for this failure is the

fact that fraud detection must deal with some uniquely

challenging properties. These properties are described

below:

1. Biased classes: Perhaps one of the most defining and

important characteristics of fraudulent activity is the

bias in positive and negative experiences presented

during learning; i.e. the proportion of fraudulent

records to non-fraudulent records is heavily biased.

This ‘‘needle-in-a-haystack’’ like effect typically leads

to unacceptably high false positive rates in the data

[17]. Indeed, these ‘‘rare’’ instances can have signif-

icant negative impacts on performance in data mining

problems [31] and hence any fraud detection system

must learn to overcome the problems associated with

the rarity of positive examples in its learning process.

2. Online learning: Since complete data is not available

from the onset (it becomes available gradually over

time), we typically have only partial information about

the problem at hand [30]. This complicates the

learning process as there is a lack of sufficient data

for training purposes. Indeed, the detection system

may need to approximate or generalise from the

limited experiences it has been presented, but more

importantly must also be able to quickly adapt as new

(potentially conflicting) experiences are observed.

3. Adaptive adversaries: Another difficult aspect of fraud

detection is the perpetual ‘‘arms race’’ that occurs

when dealing with intelligent adaptive adversaries who

vary their techniques over time [9]. As the detection

techniques mature, fraud perpetrators also become

increasingly sophisticated in both their methods of

attack and detection evasion. This means that classi-

fiers may be undermined by their own success; over-

specialisation is a risk.

4. Concept drift: ‘‘Concept drift’’ means that the statis-

tical properties of the target variable that the model is

trying to predict change over time in unforeseen ways.

This makes the predictions less accurate as time

passes. Fraud detection suffers from a continuously

changing concept definition [20]. Therefore, the

approaches used for this purpose should be able to

adapt themselves in the presence of drifting concepts.

5. Unequal misclassification costs: In fraud detection,

misclassification costs (false positive and false nega-

tive error costs) are unequal, uncertain, can differ from

example to example, and can change over time. In

some domains, a false negative error is more costly

than a false positive (e.g. intrusion detection), whereas

in other domains (e.g. spam detection) the situation is

reversed [17]. This adds to the complexity of the

learning problem and the absolute need for high

accuracy.

6. Noise: Noise is the random perturbations present in

data that lead to variation in observations from the true

data signal. The presence of noise in data is almost

unavoidable, potentially introduced during any number

of stages including data collection, transmission, or

processing. Alas, the data-sets analysed and monitored

in fraud detection, partly due to their magnitude, have

a considerable amount of noise. Unfortunately, the

presence of noise makes learning more difficult,

typically slowing the rate of learning as the system

needs to compensate for false experiences or inaccu-

rate rewards [12].

7. Fast processing over large data sizes: In some

applications such as network intrusion detection and

email spam detection, fraud detection must be done in

near real-time. In these cases, the speed of processing

is critical, thus not affording the fraud detection

software the luxury of extended processing times [9].

4 Learning classifier systems

Many researchers have explored different machine learning

and computational intelligence techniques for specific

applications of fraud detection. For example, email spam

detection has been examined using naıve Bayes [19],

support vector machines [18], artificial immune systems

[14], and evolutionary algorithms [7]. However, precious

few examples exist in applying a learning classifier system

(LCS) to fraud detection; the most notable being the recent

works of Shafi et al. [21] and Tamee et al. [25] on using an

LCSs for network intrusion detection. The former analyse

two LCS variants on a subset of a publicly available

benchmark intrusion detection data-set, suggesting some

improvements for these two algorithms. They also compare

the performance of these methods with other machine

learning algorithms and conclude that LCSs are a com-

petitive approach to network intrusion detection. The work

by Tamee et al. proposes using Self-Organizing Maps

(SOMs) with LCS for network intrusion detection. They

analyse the performance of their method and show that the

proposed system is able to perform significantly better than

twelve machine learning algorithms.

We believe LCSs show great promise for the fraud

detection problem as they have been successfully used in a

number of related machine learning and data mining tasks

Evol. Intel. (2012) 5:139–150 141

123

[3] and, due to their online learning capabilities in dynamic

environments [2] and straightforward knowledge repre-

sentation, seem readily applicable to this kind of domain.

4.1 XCS

XCS [32] is an accuracy-based LCS in which it is not the

classifiers which produce the greatest rewards that have the

best chance of propagating their genes into subsequent

generations, but rather the classifiers which predict the

reward more accurately independent of the magnitude of

the reward. When a state is perceived by an LCS, the rule-

base is scanned and any rule whose condition matches the

current state is added as a member of the current ‘‘match

set’’ M. In XCS—specifically in exploit phases—once

M has been formed, an action which has the highest fitness

weighted prediction is chosen. All members of M that

propose the same action as the chosen rule then form the

‘‘action set’’ A. The accuracy of all members of the action

set are then updated when the reward from the environment

is determined.

Another important difference of XCS over its prede-

cessors is that the GA is applied to A (or niches) rather than

the whole classifier population. This narrows the compe-

tition to occur between classifiers which can be applied in

similar scenarios rather than all the classifiers in the LCS

[22].

4.2 XCSR

Traditionally XCS has been used for problems with binary

strings as input. However, many real-world problems have

continuous values and can not be represented by the ternary

representation used by XCS. A continuous version of XCS

called XCSR was introduced by Wilson [33] which handles

this kind of problems using an interval-based representa-

tion. For this work, we use the Unordered Bound Repre-

sentation (UBR) [24] representation in which the interval

predicate takes the form of (pi, qi) where either pi or qi can

be the maximum or minimum bound.

5 Experiments

In this section, we present experiments using an abstract

problem (random Boolean function) that can be tuned to

exhibit the characteristic properties of fraudulent activity

defined in Sect. 3 one at a time and a real-world problem

(KDD Cup 1999 intrusion detection data-set) that exhibits

many of these features in one composite problem. We

analyse the performance of XCSR on seven facet-wise

experiments, namely in the presence of biased input, biased

output, incomplete information, noise, concept drift, and

adaptive adversaries. All the results reported are the aver-

ages of 20 independent runs.

Since we are dealing with fraud detection where data

can heavily be biased, accuracy—the percentage of correct

predictions made by the system—is not a suitable perfor-

mance measure, as it may produce misleading conclusions.

The reason is that if the output is biased towards one class,

a high accuracy does not necessarily correlate with good

performance. For instance, if in a problem, 2 % of the

examples belong to class A and 98 % to class B, then

having only 1 classifier which classifies everything as class

B gives a high accuracy of 98 % which is actually a poor

performance considering its misclassification of all class A

instances. Instead, we use a performance metric based

on the product of the true positive rate (TPR = TP) /

(TP ? FN)) and the true negative rate (TNR = TN) /

(TN ? FP)) [29]. It should be noted that in some cases,

such as network intrusion detection, TPR is more impor-

tant. In some cases, such as spam detection, TNR is more

important, and in order to have no bias towards the cost of

examples, we use the product of these two values as our

performance metric.

5.1 Experimental framework

5.1.1 Random boolean function

Traditionally, due to its simplicity, the test problem on

which LCS is examined is the multiplexer problem. But,

due to its inflexibility in terms of introducing bias and

concept drift into it, we decided to implement a new more

controllable test problem, namely random Boolean func-

tion. A Boolean function of n Boolean variables,

p1; p2; . . .; pn, is a function f : Bn ! B such that

f ðp1; p2; . . .; pnÞ is a Boolean expression. If n = 6, a ran-

dom Boolean function (RBF) is created by generating

randomly a 2n = 64 bit binary string called the action

string which represents the correct action for every possi-

ble state (6-bit binary string). By calculating the decimal

value of a 6-bit binary string, the index of the correct action

in the action string is found. For example, if the first 10 bits

of the action string are 0110010110, then the correct action

for state 000101 is the value of the 5th bit (counting from

0) in the action string—a 1 in this example.

We note that RBFs are more suitable than the traditional

multiplexer benchmarks to capture properties typical of an

environment containing instances of fraud. Such properties

include output bias and complexity (the minimum number

of classifiers to represent a multiplexer grows linearly with

the size of the input string, whereas for an RBF we would

expect it to be exponential).

In order to ensure a fair comparison between the dif-

ferent algorithms, 20 different random Boolean functions

142 Evol. Intel. (2012) 5:139–150

123

(20 action strings) of equal difficulty are generated at the

start of an experiment; experimental results report the

performance of each algorithm averaged across all 20

functions.

As we are dealing with real valued inputs, a real

version of the RBF problem is defined as follows: for an

input vector x ¼ ðx0; . . .; xnÞ, every xi is a real value in

the range 0 B xi \ 1. Every real-valued input vector is

mapped to a binary string input, using a threshold

(0 B h\ 1): if xi \ h, xi is interpreted as 0, otherwise it

is interpreted as 1.

5.1.2 XCSR

The underlying accuracy-based LCS (XCS) is based on

Butz’s implementation of XCS [5]. All experiments except

the KDD one are run with explore and exploit trials

interleaved. For representation of intervals, we use the

Unordered Bound Representation (UBR) without limiting

the ranges to be between 0 and 1. The values of important

XCSR parameters, if not stated differently, are as follows:

N for RBF4: 800 and for RBF6: 2000, a : 0:1; b : 0:2;

m : 5; �0 : 0:01; hGA : 25; hSub : 20.

5.1.3 KDD cup 1999

We also evaluate the performance of XCSR on the 1999

KDD Cup data-set. Although this corpus is more than

10 years old, the difficulty of collecting real world labelled

network intrusion data and the fact that this corpus is a

benchmark data-set for the evaluation of network intrusion

detection systems which has been used by many

researchers and hence being able to compare the results and

performance, make it a reasonable choice of data-set for

our purpose. The only LCS parameters with different

values are hSub: 12 and N: 2,000 for when learning is

allowed during test and 50,000 when learning is not

allowed.

5.2 Experiment 1: input bias

In this experiment, we start with a 4-bit RBF environment

in which every state has the same number of bits belonging

to class 0 as the bits belonging to class 1. We then modify

this balance and gradually introduce bias to the input by

making 1 bits rare. In our other experiments, generating an

input state is done by generating its bits one by one using

the following scheme: a random number between 0 and 1 is

generated, and if this number is less than 0.5, then the

corresponding bit will be a 0, otherwise a 1. Now, in order

to introduce bias to the inputs, the threshold is shifted. For

example, for threshold of 0.75, if the random number is less

than 0.75, then the corresponding bit will be a 0, otherwise

a 1. This way the probability of a bit in the input string

being a 0 is 75 % and a 1 is 25 %. The numbers on the

horizontal axis of Fig. 1 are calculated based on the men-

tioned threshold using |0.5 - threshold| 9 200.

Figure 1 illustrates the average performance (product of

TPR by TNR) of XCSR in three different time windows:

during early stages of learning (between 5,000 and 6,000

trials), in the middle way (between 15,000 and 16,000

trials), and finally after the performance has stabilized

(between 35,000 and 36,000 trials).

As the figure shows, input bias does affect the perfor-

mance of XCSR, especially during the early trials of a run.

Perhaps counter-intuitively, during the early phases of

learning, the performance in environments with a higher

input bias appears to be better than environments with a

lower input bias. But, as the algorithm continues learning,

performance on environments with lower input bias

Fig. 1 The effect of input bias

for the RBF-4 problem

Evol. Intel. (2012) 5:139–150 143

123

catches up and the algorithm ends up performing just as

well as on the highly-input-biased ones. The reason is that

with a high bias in input strings, most of the states belong

to a much smaller state space. Learning this space is easier

and requires fewer training trials than lowly biased

environments.

5.3 Experiment 2: output bias

This experiment analyses the performance of XCSR for

different levels of output bias. For this experiment, the

6-bit RBF problem (RBF-6) is used; the experiment varies

the proportion of 1 and 0 s in the action string of the

random Boolean function providing a means of varying the

output bias in experiences offered to the LCS. A 0 %

biased environment is a completely balanced environment,

while 100 % represents a completely biased environment.

It should be noted that this form of bias is what most papers

working on ‘‘class imbalance’’ use.

In order to investigate the effects of output bias, the

output bias was systematically varied between 0 and 96 %

and the performance of different LCS variants determined

for each output bias value. Figure 2 summarises the results,

reporting the performance for 1,000 trials after trial 50,000.

Figure 3 reports the same, but for much longer number of

trials for problems with higher output bias. It should be

noted that in real world fraud problems, output bias is

reported to be about as large as 99.99 % [8].

As we can see from Fig. 2, in the unbiased RBF-6

environment, all XCSR variants obtain an acceptable per-

formance (TPR.TNR) of 93 % after approximately 50,000

trials. The red broken line in Figs. 2 and 3 plot the average

Fig. 2 The effect of output bias

for the RBF-6 problem in trials

50,000 to 51,000. Figure is best

seen in colour


for the RBF-6 problem in the

last 1000 trials suggested by

Orriols method (as suggested by

Orriols, the number of trials

varies depending on the level of

output bias). Figure is best seen

in colour

144 Evol. Intel. (2012) 5:139–150

123

performance of the baseline XCSR. Note that the perfor-

mance generally decreases as the output bias is varied from

0 to 96 %. Surprised by the relative poor performance of

XCSR on highly biased environments, we varied the

algorithm’s control parameters in order to ‘‘tune’’ the

algorithm for the problem at hand. We found that by

raising the subsumption threshold (hSub: 60) and making it

more difficult for a classifier to be judged as perfectly

accurate (�0: 0.001), the performance of XCSR was highly

improved. It should be mentioned that Orriols et. al [16]

also suggest increasing hSub in presence of class imbalance.

We also observed that disabling (both forms of) sub-

sumption improves the performance substantially in the

biased environments. The black solid line decorated with

triangles in Figs. 2 and 3 plots the performance of this

variant. We observe from Fig. 3 that by tuning and dis-

abling subsumption, the performance of XCSR is sub-

stantially surpassing the performance of baseline XCSR,

obtaining an acceptable performance for output-biased

environments. By removing subsumption, the number of

specialised classifiers in the population increases, hence

allowing a relatively larger knowledge base for decision

making, giving the LCS the power to discriminate between

states with a finer resolution. However, this increase in the

number of specialised classifiers decreases the generaliza-

tion capability of the algorithm and is ultimately limited by

the maximum size of the classifier set in the LCS. For

many problems, this fine-scale specialisation may simply

not be feasible.

Also plotted on Figs. 2 and 3 are two variants of XCSR

specifically modified to handle output-biased environ-

ments. The first, by Nguyen et al. [13], modifies three

baseline XCSR control parameters: two of which are set to

different constant values (a: 0.05, m: 10) and the third (�0) is

calculated based on the inverse of the output bias level. Its

performance is plotted using a brown solid line decorated

with circles. The second one is an XCSR variant based on

the recommendations of Orriols-Puig and Bernado-Man-

silla [15]—the performance of this variant plotted using a

blue solid line. This XCSR variant uses tournament

selection in the discovery component of the LCS and GA

subsumption (but no action set subsumption) with the

minimum GA experience (hGA) set to be inversely pro-

portional to the output bias of the environment. In essence,

this variant ‘‘protects’’ classifiers from being subsumed or

replaced without adequate experience in biased environ-

ments, ensuring GA occurs less frequently in these

situations.

The recommendations of Orriols-Puig and Bernado-

Mansilla also include running the XCSR for a notably

longer number of trials than those reported in Fig. 2—the

number of trials increasing proportionally with the output

bias of the environment. Using their recommendations,

Fig. 3 plots the performance of each algorithm variant after

that many trials.

In comparison with Fig. 2, we see in Fig. 3 that the

performance of all XCSR variants, especially in highly

biased environments, highly improves with the increased

numbers of learning trials. However, the number of trials

suggested by Orriols-Puig and Bernado-Mansilla is very

large, with over 2.2 million trials needed at the 96 % output

bias value. Such large training times are likely to be

infeasible, thus limiting its applicability.

Figure 4 plots the average performance of the four LCS

variants when we continue learning up to the maximum

number suggested by Orriols—2,205,000 for 96 % output

skewness—reporting performance during the last 1,000

trials of the run. As the figure shows, most of the algo-

rithms learn a perfect answer to the problem, except when

the output bias is very high, but performance is still

acceptable in these cases. This is not surprising given large

number of training trials.

Overall, the experiments reported in this section show

that the baseline XCSR algorithm does not perform well on

environments with highly biased outputs. In these cases,

the algorithm needs to be tuned to the problem at hand and

be ‘‘trained’’ for much longer periods of time.

5.4 Experiment 3: adaptability in incomplete domains

This experiment evaluates the performance of (baseline)

XCSR on a 4-bit RBF problem in the presence of

incomplete information, simulating the requirement for

online learning in a fraud detection system. To achieve

this, a part of the environment (a percentage of the input

states) is kept hidden and only after some predetermined

time (the midway point of the experiment) are these

hidden states made visible; the hidden states being rep-

resentative of some fraudulent activity occurring

within the environment that is not present at the begin-

ning of the ‘‘monitoring’’. That is, the algorithm is not

exposed to these hidden states (and hence can not directly

learn from these experiences) until the midway point of

the experiment, after which time they are added to the

pool of possible inputs, gradually being exposed to the

LCS.

In this experiment, we hide different percentages of

possible states of the 4-bit RBF problem from the learning

algorithm for 25,000 trials, reveal them at the 25,000 point

(allow the system to use them), and then the experiment

continues for another 25,000 trials. Since we are dealing

with continuous-valued inputs, the hiding is applied to the

ranges of each feature of the state. For instance, when the

problem is the 4-bit version of the RBF, each of the four

features of every state is a real value between 0 and 1. For

an incompleteness level of 10 %, ten percent of every

Evol. Intel. (2012) 5:139–150 145

123

feature, whose range is determined at the beginning of each

experiment randomly, is hidden.

Figure 5 reports the results of this experiment at three

different time frames before revealing the hidden states.

Figure 6 similarly reports the results in three time frames

after the revelation of the hidden states. Figure 5 shows

that before the revelation of the hidden states, XCSR is

able to obtain near perfect performance on highly incom-

plete environments due to the small size of the learning

space. Figure 6 shows that after the revelation of the hid-

den states, this performance sharply declines shortly after

their inclusion. However, by continuing the learning, the

LCS can improve its understanding of the environment

(contrast the solid blue line in Fig. 6 which represents the

average performance between trials 25,000 and 26,000, and

green solid line decorated with squares which represents

the average performance between trials 49,000 and

50,000). Another observation evident from Fig. 6 is that up

to 30 % incompleteness does not have a serious adverse

effect on the performance after revelation as long as the

algorithm has enough time to learn (about 40,000 trials in

the case of RBF-4). These results confirm that XCSR is

indeed capable of adaptively responding to changes in the

environment.

5.5 Experiment 4: noise

This experiment tests the effect of noise on performance of

(baseline) XCSR by introducing different levels of noise

from 0 to 50 % to the 4-bit RBF problem. Here, noise

corresponds to the environment erroneously inverting the

reward for a chosen action; i.e. returning 0 for 1,000

(maximum payoff) and 1,000 for 0. Consequently, 50 %

corresponds to completely random output.

Figure 7 shows the performance of XCSR in presence of

different levels of noise in three periods: between 5,000


for the RBF-6 problem in trials

2,204,000 to 2,205,000. Figure

is best seen in colour

Fig. 5 The effect of hidden

states for the RBF-4 problem

before revelation of hidden

states

146 Evol. Intel. (2012) 5:139–150

123

and 6,000 trials, between 15,000 and 16,000 trials, and

between 35,000 and 36,000 trials. The results show that

XCSR is tolerant to small levels of noise; performance only

starting to considerably degrade for noise levels of 25 %

and higher. It should be mentioned that a study on effect of

noise on UCS (a supervised variant of XCS) has been done

by Dam [6] in which UCS is proven to be robust to similar

levels of noise.

5.6 Experiment 5: concept drift

Hai Dam [6] has studied the effect of concept drift on UCS

in stream data mining and concludes that provided that

UCS has a large enough population, it can adapt to changes

effectively. Here we analyse the effect of concept drift on

XCSR and concept drift is simulated as follows: (baseline)

XCSR tries to learn a 6-bit RBF problem, while every 500

trials a percentage of the states switch class. Figure 8

reports the results of the performance of XCSR in such

environment in three different learning trials: between

15,000 and 16,000 (early stages of learning), between

35,000 and 36,000, and between 98,000 and 99,000 (where

performance has almost converged). As Fig. 8 shows,

when concept changes, XCSR requires much higher num-

ber of trials to learn the problem. We also examined the

theory of Dam [6], making population size larger to see its

effect on the performance of XCSR, but its effect was

nominal—for RBF-6 problem population size larger than

2000 did not make any significant improvement.

5.7 Experiment 6: targeted bias drift

(adaptive adversaries)

In this experiment, (baseline) XCSR is dealing with an

adversarial environment. The environment tries to change

the output bias in a way that XCSR performs more poorly.

In other words, when the environment observes that XCSR

is misclassifying examples belonging to class 0, it sends

Fig. 6 The effect of hidden

states for the RBF-4 problem

after revelation of hidden states

Fig. 7 The effect of noise for

the RBF-4 problem in three

different stages of learning

Evol. Intel. (2012) 5:139–150 147

123

examples of that class more often. Figure 9 shows the

results.

When dealing with adaptive adversaries of the kind

defined in this experiment, as Fig. 9 shows the perfor-

mance of XCSR initially deteriorates, but quickly learns

the environment even faster. The reason for this behav-

iour is that when XCSR misclassifies an example of a

certain class, as it sees more examples of that class, it can

quickly update its classifiers and classify examples of

that class better. Therefore, having such an adaptive

adversary not only is not a bad thing for XCSR, but also a

boon.

5.8 Experiment 7: real-world data

In this experiment, we test our implementation on the

KDD Cup 1999 intrusion detection data-set [23]. This

data-set contains 4,898,430 experiences (connections) in

the training file and 311,029 in the test file. A smaller

version of the training file, which contains a 10 % subset

of the training data at the same distribution as the com-

plete file, was used in this work to reduce overall running

times. For each file, every ‘‘connection’’ has a vector of

41 features and a label. The label determines if the con-

nection is normal or an attack. There are 22 types of

attack in the training file and 39 in the testing file. Every

attack type belongs to one of four attack categories.

Performance is measured using a cost per example value

which is based on a cost matrix (note however, as per the

contest rules, the cost matrix is hidden from the learning

system during training). Although it is possible (and

probably beneficial) to train XCSR using multiple passes

of the training data, we only use one pass in order to

evaluate the algorithm as an online learner.

Fig. 8 The effect of concept

drift for RBF-6

Fig. 9 The effect of targeted

bias drift for the RBF-6 problem

148 Evol. Intel. (2012) 5:139–150

123

Two metrics are used for evaluation of performance.

The first one is accuracy, which is the percentage of the

correct predictions made by the system and is calculated

using true positive, true negative, false positive and false

negative values: (TP ? TN) / (TP ? FN ? FP ? TN).

The second metric is Cost Per Example (CPE) and is cal-

culated using the following formula [26]: CPE ¼1N

Pmi¼1

Pmj¼1 CMði; jÞ � Cði; jÞ where CM is the Confusion

Matrix (the number of instances of class i classified as class

j), C is the Cost Matrix (the cost of misclassifying class i as

class j), N represents the total number of test instances and

m is the number of the classes. A smaller CPE means a

better classification.

In our implementation, after receiving a connection, the

system normalizes the values between -1 and 1. During

the training phase, explore is used in which there is a 50 %

chance of selecting the class randomly (for exploration

purposes). During the test phase, explore and exploit are

interleaved, but during explore phases, all classes are

selected deterministically—the class with the highest fit-

ness weighted prediction is chosen. Also, we ran experi-

ments in which during the test phase, learning is disabled

(and only exploit is used). As expected, this decreases

overall performance.

As Table 1 shows, as expected, XCSR performs best

when the learning is allowed during the test phase (feed-

back is provided). When learning is not allowed during

testing, the performance of XCSR is still reasonable—

within 1 % of the KDD cup 1999 winner. Note that while

XCSR has seemingly not learned as accurate a model as the

KDD cup 1999 winner, XCSR was not tuned for this

problem at all.

It should be noted that this real-world problem has most

of the common properties of fraud discussed in Sect. 3:

• Input bias. Most fields of the corpus have rare inputs.

For example, the su_attempted field in the test file

contains 311024 connections with a value of 0, only 1

connection with a value of 3, and 2 connections with a

value of 2. The training file has a similar distribution.

• Output bias. For instance, in the training file, 3884410

connections belong to category 2 and only 52 connec-

tions belong to category 3.

• Adaptive adversaries and online learning. There are 14

new attack types in the test file which do not occur in

the training file.

• Concept drift. One attack type in the training file

changes category in the test file.

• Unequal misclassification costs. In the cost matrix, the

cost of misclassifying some attacks is higher than

others.

6 Conclusions and future work

Fraud detection is a challenging problem due to the

adversarial nature of perpetrators and the rarity of fraud-

ulent events among the magnitude of legitimate activity.

In this research, we detailed seven general characteristics

of fraud and motivated the use of LCSs for learning in

such environments. Through the use of abstract problems

which enable systematic analysis of these characteristics,

we examined the behaviour of XCSR in the presence of

these challenges. Table 2 shows a summary of the com-

mon properties of fraud and the matching experiments in

this work. Future work will also explore test problems that

examine the features of Table 2 in more detail, investi-

gating on giving batch-style feedback—the kind you

would expect in real-world fraudulent domain, speeding

up the XCSR process by reducing population size,

examining the complexity of functions as well as their

bias and examining the connection between dimension-

ality of the problem (number of features) and population

size.

Acknowledgments The first author would like to acknowledge the

financial support provided by the Robert and Maude Gledden

Scholarship.

Table 1 Cup winner and XCSR performances on KDD Cup 1999

data-set

Algorithm CPE Accuracy

(%)

KDD Cup 1999 winner 0.233 92.71

LCS, without on-line learning 0.266 91.56

LCS, with on-line learning 0.092 96.48

Table 2 Experiments summary

Properties Exp LCS performance

Biased classes 1, 2,

3,

8

Requires tuning and more trials, but in

general, overcomes these challenges

Online learning 4, 8 Very responsive and quick to adapt

Adaptive adversaries 7, 8 Inherently an online-learner

Concept drift 6, 8 Inherently an online-learner

Unequal

misclassification

costs

8 Requires incorporating cost into rule

updating procedure

Noise 5 Sufficiently robust

Fast processing over

large data sizes

8 Responsive because of online learning

(although larger number of

dimensions require larger

population sizes)

Evol. Intel. (2012) 5:139–150 149

123

References

1. Australian Bureau of Statistics (2008) Personal fraud, 2007. Tech.

Rep. 4528.0, Australian Bureau of Statistics

2. Bacardit J, Bernado-Mansilla E, Butz MV (2008) Learning

classifier systems: looking back and glimpsing ahead. In: 10th

international workshop on learning classifier systems, Springer,

Berlin, pp 1–21

3. Bull L (2004) Applications of learning classifier systems.

Springer, Berlin

4. Bull L (2004) Learning classifier systems: a brief introduction.

Applications of Learning Classifier Systems pp 3–14

5. Butz MV, Wilson SW (2000) An algorithmic description of XCS.

Tech. Rep. 2000017, Illinois Genetic Algorithms Laboratory

6. Dam HH (2008) A scalable evolutionary learning classifier sys-

tem for knowledge discovery in stream data mining. PhD thesis,

University of New South Wales—Australian Defence Force

Academy

7. Dudley J, Barone L, While L (2008) Multi-objective spam fil-

tering using an evolutionary algorithm. In: Congress on evolu-

tionary computation, pp 123–130

8. Duman E, Ozcelik MH (2011) Detecting credit card fraud by

genetic algorithm and scatter search. Expert Syst Appl

38(10):13,057–13,063

9. Fawcett T, Haimowitz I, Provost F, Stolfo S (1998) Ai approa-

ches to fraud detection and risk management. AI Magaz

19(2):107–108

10. Feng Y, Zhong J, Xiong Zy, Ye Cx, Wu Kg (2007) Network

anomaly detection based on DSOM and ACO clustering. In:

Advances in neural networks-ISNN 2007, lecture notes in com-

puter science, vol 4492, Springer, Berlin pp 947–955

11. Kou Y, Lu CT, Sirwongwattana S, Huang YP (2004) Survey of

fraud detection techniques. In: Networking, sensing and control,

2004 IEEE international conference on, vol 2, pp 749–754

12. Nettleton DF, Orriols-Puig A, Fornells A (2010) A study of the

effect of different types of noise on the precision of supervised

learning techniques. Artif Intell Rev 33(4):275–306

13. Nguyen TH, Foitong S, Srinil P, Pinngern O (2008) Towards

adapting xcs for imbalance problems. In: PRICAI ’08, Springer,

Berlin, pp 1028–1033

14. Oda T, White T (2005) Immunity from spam: an analysis of an

artificial immune system. In: 4th International conference on

artificial immune systems, Springer, Berlin, pp 276–289

15. Orriols-Puig A, Bernado-Mansilla E (2008) Evolutionary rule-

based systems for imbalanced data sets. Soft Comput 13(3):

213–225

16. Orriols-Puig A, Bernado-Mansilla E, Goldberg DE, Sastry K,

Lanzi PL (2009) Facetwise analysis of XCS for problems with

class imbalances. IEEE Trans Evol Comput, submitted 3(5):

1093–1119

17. Phua C, Lee V, Smith-Miles K, Gayler R (2005) A comprehen-

sive survey of data mining-based fraud detection research. Tech.

rep., Monash University

18. Sculley D, Wachman GM (2007) Relaxed online SVMs for spam

filtering. In: 30th annual international ACM SIGIR conference on

research and development in information retrieval, ACM Pub-

lishers, pp 415–422

19. Seewald AK (2007) An evaluation of naıve Bayes variants in

content-based learning for spam filtering. Int Data Anal11(5):

497–524

20. Shafi K, Abbass HA (2007) Biologically-inspired complex

adaptive systems approaches to network intrusion detection.

Inform Secur Tech Rep 12(4):209–217

21. Shafi K, Kovacs T, Abbass HA, Zhu W (2009) Intrusion detection

with evolutionary learning classifier systems. Nat Comput

8(1):3–27

22. Sigaud O, Wilson SW (2007) Learning classifier systems: a

survey. Soft Comput A Fusion Found Methodol Appl 11(11):

1065–1078

23. Stolfo S, et al (1999) KDD cup 1999 dataset. UCI KDD reposi-

tory. http://kdd.ics.uci.edu

24. Stone C, Bull L (2003) For real! XCS with continuous-valued

inputs. Evol Comput 11(3):299–336

25. Tamee K, Rojanavasu P, Udomthanapong S, Pinngern O (2008)

Using self-organizing maps with learning classifier system for

intrusion detection. In: PRICAI ’08, Springer, Berlin, pp 1071–

1076

26. Toosi AN, Kahani M (2007) A new approach to intrusion

detection based on an evolutionary soft computing model using

neuro-fuzzy classifiers. Comput Commun 30(10):2201–2212

27. Tsai CF, Hsu YF, Lin CY, Lin WY (2009) Intrusion detection by

machine learning: a review. Expert Syst Appl 36(10):11,994–

12,000

28. Urbanowicz RJ, Moore JH (2009) Learning classifier systems: a

complete introduction, review, and roadmap. J Artif Evol App

2009:1–1125

29. Vanderlooy S, Sprinkhuizen-Kuyper I, Smirnov E (2006) Reli-

able classifiers in ROC space. In: 15th BENELEARN machine

learning conference, pp 27–36

30. Vatsa V, Sural S, Majumdar A (2005) A game-theoretic approach

to credit card fraud detection. Lect Notes Comput Sci 3803:

263–276

31. Weiss GM (2004) Mining with rarity: a unifying framework.

ACM SIGKDD Explor Newslett 6(1):7–19

32. Wilson SW (1995) Classifier fitness based on accuracy. Evol

Comput 3(2):149–175

33. Wilson SW (2000) Get real! XCS with continuous-valued inputs.

In: Learning classifier systems, from foundations to applications,

Springer, Berlin pp 209–222

34. Yue D, Wu X, Wang Y, Li Y, Chu CH (2007) A review of data

mining-based financial fraud detection research. In: Wireless

communications, networking and mobile computing, pp 5519–

5522

150 Evol. Intel. (2012) 5:139–150

123

http://kdd.ics.uci.edu

On XCSR for electronic fraud detection

Documents

Transcript of On XCSR for electronic fraud detection