Cerebellar Model Classifier for Data Mining With Linear Time Complexity

7/28/2019 Cerebellar Model Classifier for Data Mining With Linear Time Complexity

1/16

International Journal of Computational Intelligence and ApplicationsVol. 6, No. 3 (2006) 299313c Imperial College Press

A CEREBELLAR MODEL CLASSIFIER FOR DATA MINING

WITH LINEAR TIME COMPLEXITY

DAVID CORNFORTH

School of Information Technology and Electrical Engineering

University of New South WalesAustralian Defence Force Academy

Northcott Drive, Canberra, ACT 2600, Australia

Received 14 October 2003Revised 24 July 2006

Accepted 7 August 2006

Techniques for automated classification need to be efficient when applied to largedatasets. Machine learning techniques such as neural networks have been successfullyapplied to this class of problem, but training times can blow out as the size of thedatabase increases. Some of the desirable features of classification algorithms for largedatabases are linear time complexity, training with only a single pass of the data, andaccountability for class assignment decisions. A new training algorithm for classifiersbased on the Cerebellar Model Articulation Controller (CMAC) possesses these fea-tures. An empirical investigation of this algorithm has found it to be superior to thetraditional CMAC training algorithm, both in accuracy and time required to learn map-pings between input vectors and class labels.

Keywords: Cerebellar model articulation controller; classification; training.

1. Introduction

A well-studied class of machine learning problems is that of categorization, or clas-

sification. Here, the key is to determine some relationship between a set of input

vectors that represent stimuli, and a corresponding set of values on a nominal scale

that represent category or class. The relationship is obtained by applying an algo-

rithm to training samples that are 2-tuples u, c consisting of an input vector u

and a class label c. The learned relationship can then be applied to instances ofu

not included in the training set, in order to discover the corresponding class label c.1

A number of machine learning techniques including genetic algorithms,2 and neural

networks3

have been shown to be very effective in solving such problems.There are many large databases in existence that could yield valuable informa-

ti if ffi i t d l bl th d f t t d l ifi ti ld b f d 4


2/16

300 D. Cornforth

Many algorithms for automated classification have an inherently non-linear rela-

tionship between time taken by the algorithm to run and the number of train-

ing examples. Analysis methods that work well for small data sets are completelyimpractical when applied to larger data sets. For example, training of a neural

network using back-propagation is known to be NP-complete.5 Some studies sug-

gest that evolutionary algorithms have polynomial time complexity.6 The work

presented here investigates classification algorithms based on the Cerebellar Model

Articulation Controller (CMAC),7 which have linear time complexity.

Global error minimization techniques, such as back-propagation, require mul-

tiple traversals of the data set during training. If the training set is very large,

it cannot fit inside the memory of the machine. This will result in multiple disk

read/write operations, which are relatively costly in time and can contribute greatly

to data processing time. Current approaches include compression or summary of

the data set before processing, and redesign of analysis tools so that analysis can be

completed with only one pass of the data. This paper shows how the original CMAC

training algorithm, which normally uses an iterative global error minimization tech-

nique, may be adapted so that the training set only needs to be accessed once.

The usefulness of a classification algorithm may be enhanced by providing an

explanation for each class assignment decision. This could take the form of a set of

rules that contribute to the assignment, or a probability for each class, given the

input. Black box methods such as neural networks do not naturally lend themselves

to this form of analysis. The new algorithm described here provides accountability

for class assignment decisions in the form of class probabilities.

In this paper, I propose the Kernel Addition Training Algorithm (KATA) as

a more effective learning algorithm for the CMAC when used as a classifier. The

proposed method requires only a single pass of the data and provides a probability

model for class assignment decisions.

The organization of the remainder of this paper is as follows. Section 2 briefly

reviews the architecture of the CMAC, and introduces the proposed modifications.Section 3 provides an empirical investigation of the new fast learning algorithm and

the traditional error minimization methods. Section 4 provides a discussion of the

results and implications arising from them.

2. Cerebellar Model Articulation Controller

The CMAC, or Albus perceptron, is a sparse coarse-coded associative memory

algorithm that mimics the functionality of the mammalian cerebellum.8 Originally,

the CMAC was proposed as a function modeler for robotic controllers,7

but hasbeen extensively used in reinforcement learning9,10 and also as a classifier.1114

Th t i i th d d b Alb it ti l ith b d l b l


3/16

A Cerebellar Model Classifier for Data Mining with Linear Time Complexity 301

requires multiple passes of the training data. The method proposed in this paper

requires only a single pass of the data. Furthermore, it provides a probability model

for class assignment decisions.The CMAC is able to accept real valued inputs. An input vector u with d

components may be visualized as a point in d-dimensional space. The input space

is quantized using a set of q overlapping tiles as shown in Fig. 1(a), where q = 2.

For input spaces of high dimensionality, the tiles form hyper-rectangular regions.

(a)


4/16

302 D. Cornforth

A query is performed by first activating all the tiles that contain a query point.

The activated tiles address memory cells, which contain stored values. These are

the weights of the system, as shown in Fig. 1(b). The summing of these valuesproduces an overall output. The CMAC output is therefore stored in a distributed

fashion, such that the output corresponding to any point in input space is derived

from the value stored in a number of memory cells.

A change of the input vector results in a change in the set of activated tiles, and

therefore a change in the set of memory cells participating in the CMAC output.

The memory size required by the CMAC depends of the number of tilings

and the size of tiles. If the tiles are large, such that each tile covers a large proportion

of the input space, a coarse division of input space is achieved, but local phenomena

have a wide area of influence. If the tiles are small, a fine division of input space is

achieved and local phenomena have a small area of influence. The number of tiles

in the input space, and therefore the number of memory cells, is usually sufficiently

large to become prohibitive due to memory constraints. Many of these tiles are

never used due to the sparse coverage of the input space. One solution is to employ

a consistent random hash function to collapse the large tiling space into a smaller

memory cell space.16 This reduces the memory use, but still requires relatively

large memory requirements for a classifier. An alternative and more comprehensive

solution is the hierarchical CMAC.17 Here, several low-dimensional CMACs are

connected to form a multi-layer tree structure. Training is accomplished by min-

imizing the output error, and back-propagating errors to hidden layers. The tree

structure can also be pruned to reduce redundant nodes.18 This method cannot be

employed here because it is not compatible with the training rule presented.

The CMAC learns a mapping from input space U Rd to output space Z R,

where d is the number of dimensions, or the size of the input vector. Following

existing convention, this can be broken into three mappings12:

The input space to multi-layer tiling system mapping E : u x.

The multi-layer tiling system to memory table mapping H : x y.

The memory table to output mapping (weighted summation) W : y z.

The mapping E can be implemented using simple integer division in each dimen-

sion. The integer values for each dimension are combined to form one address for

each tiling layer. Addresses for the other tiling layers are calculated in a similar

way. The mapping H receives q addresses that must be mapped to memory cells.

This mapping is usually implemented by a hashing function. The mapping W is a

weighted summation of the contents of the memory cells. These values are set dur-

ing training. An improvement over the Albus CMAC is the widely adopted practiceof embedding kernel functions into the quantizing regions.1921 This modifies the


5/16


(a) (b)

Fig. 2. How kernel functions may be embedded into a 2-dimensional tiling grid. (a) Step kernelfunction. (b) Linear kernel function.

Each weight y is indexed by address a, and the kernel function k is applied to

some distance measure of the query point from the centre of the tile. The number

of tiling layers is q. Some common kernel functions are illustrated in Fig. 2.

2.1. Output mapping

The CMAC may be used as a classifier by adopting a suitable mapping between the

real valued output variable z and the nominal variable class label c. One possible

mapping8 interprets positive values of z as one class, and negative values of z as

another class. This is sufficient for two class problems, and is the most often cited in

the literature.12,13,22,23 For problems with more than two classes, one could definethreshold values such as to divide the scalar range of z into the number of classes

to be represented:

c = v : tlowv < z < thighv (2)

where threshold tlowv > thighv1 . Equation (2) represents a scalar mapping. Using this

mapping, the CMAC can be used as a classifier if, during the training phase, weights

are adjusted to make the output z approach a suitable target value. For example,

the target for a given class could be a value equidistant from the thresholds corre-

sponding to that class.

2.2. Albus training algorithm

The Albus CMAC is trained by evaluating the error as the difference between

desired output zd and actual output z, and updating the active weights at each

time step t:

wi(t+1) = wit + (zdj zj)k(di)q

k(d ). (3)


6/16

304 D. Cornforth

2.3. Kernel addition training algorithm

The scalar mapping above is not ideal, as it represents a nominal variable using acontinuous scale, and there is no information about the degree of membership of a

class. Consider an alternative output mapping, using a CMAC for each class:

c = v : zv = max(z1, z2, . . . , zm) (4)

where m is the number of classes. Equation (4) represents a vector mapping. This

may be used to assess the decision of the classifier to assign any particular input to a

class. For example, it is possible to discover if two classes have high activation, or if

one class is the clear winner. A desirable property is that the output activations are

proportional to the probability of the class, given the input, so that zv representsa relative probability of selecting class c. Then, it is possible to take account of

a priori probability using Bayes Law:

P(ci|x) =P(ci)P(x|ci)i P(ci)P(x|ci)

(5)

where P represents probability.3 The frequency of samples occurring in each class

may be used to estimate P(ci). The goal of training then is to provide an output zithat can be used to estimate P(x|ci). There is no need to calculate the denominator,

as assignment to the highest probability class requires only comparison.

The new training algorithm, the KATA,25 uses a vector class mapping. As each

training vector is presented, a kernel function value for each activated tile is added to

the value of the corresponding memory cell. Assuming n training points distributed

uniformly over a tile, the expected value of the corresponding cell after training will

be n.ke, where ke is the expected value of the kernel function. If the kernel function

is the step function, the value of each memory cell after training is a count of

the number of times the corresponding tile was accessed during training. If the

kernel function is not the step function, then training amounts to estimation of a

histogram, using as a weight some function of the distance of the input from thecentre of the histogram bin. From the well-known properties of histograms, one

concludes that:

The value of any tile after training is proportional to the probability of inputs

activating that tile.

A histogram improves its estimate of the underlying distribution as the number

of training samples increases, so the algorithm will converge.

It is only necessary to present the training data once.

There is no value in repeated presentation of the same training data.

After training, the output z for each class will be proportional to the numerator

f E (5) Thi b b id i l ifi ti bl h th


7/16


output for this class is also doubled. Assume that the number of samples in a class

is a good estimator of P(ci). Then the CMAC output after training is proportional

to P(ci).P(x|ci).After training, each CMAC forms a piecewise model of the probability density

function for the corresponding class. There is no need to normalize the output as

in Eq. (1), so the output is given by:

z =

qi=1

k(disti) w[addri] (6)

The KATA CMAC is trained using the value of the kernel:

wi(t+1) = wit + k(disti) (7)

In contrast to the Albus training algorithm, the KATA is not an iterative algo-

rithm. The weights are updated during a single presentation of the training data

at the inputs. From this, it follows that the KATA is not sensitive to the order in

which input samples are presented. Also, the KATA is robust to outliers, as outliers

occur with low frequency, and so will have minimal effect on the CMAC output.

3. Experiments and Results

Comparing Eq. (3) with Eq. (7), it can be seen that the KATA can be completed in

less time than one iteration of the Albus algorithm. The speed advantage will not

be as great as suggested simply by comparing these equations due to the different

software overheads. However, one would expect that the KATA would be faster than

the Albus training algorithm. This conclusion was tested using computer models

of the two algorithms for comparison purposes. The experiments were designed

to demonstrate the linear relationship between number of training samples and

training time.

3.1. Artificial test problem

The two CMAC learning algorithms were tested using the parity problem. This

problem was chosen because of its low spatial frequency, ensuring that there will be

enough samples to discriminate classes in tests with a high number of dimensions

or a high number of classes. In this problem, the input space is partitioned into m

regions in each dimension, where m is the number of classes and d the number of

dimensions. Given an input vector x = {x1 xd}, 0 < xi < r, then the class label

is given by: d

flmxi

d (8)


8/16

306 D. Cornforth

Fig. 3. The parity problem for a three-dimensional input space. White represents class 0, blackrepresents class 1.

Data sets were generated using randomly generated x values, and assigning a

class label to each record according to Eq. (8). Seven data sets were generated,

containing from 2 to 5 dimensions and from 2 to 5 classes. Each database consisted

of 1 million samples, with input variables drawn from a uniform distribution. The

classifiers were tested using different numbers of samples.

3.2. Natural test problem

The two CMAC algorithms were tested using a natural data set, derived from the

1998 DARPA Intrusion Detection Evaluation Program.26 The dataset was originally

collected to establish the efficacy of intrusion detection and includes a variety of

simulated intrusions of a military computer network. A version of this was used for

the KDDcup99 contest. There are 24 classes representing different types of attack

and 41 measurements, or features used as inputs to the classifier. Some of these are

discrete and some are numeric. The datasets used in these tests contained 494,020

records.

The dataset was adapted for testing the time complexity of the CMAC algo-rithms as follows. Features with a small number of integer values were removed,

as the CMAC uses continuous inputs only. Also, features that are zero most of

the time were removed. Thus, 12 features are left. The algorithms were tested on

different numbers of records by extracting records at random, containing different

numbers of samples.

3.3. Test methodology

Both versions of the CMAC used the same parameters. Input space was uniformlyquantized in all dimensions. Tile spacing was based on the work of Parks and

Milit 27 A h hi f ti ith h i i d t hi lli i


9/16


thereby enable meaningful comparisons of running time. The Albus CMAC used a

scalar output mapping, while the KATA CMAC used a vector output mapping.

The gain term for the Albus training algorithm, , was set to 1.0 at the startof training, and reduced during training, as this guarantees quick convergence.28,29

This was implemented by setting to the value of the normalized training error.

The number of epochs used must be sufficient to allow convergence, but not too

many so as to cause over fitting. After each epoch, the accuracy was compared to

that from the previous epoch. If the accuracy had increased by less than 0.1%, then

training would be terminated.

The performance of the two algorithms was tested using three-fold cross valida-

tion, so that accuracy was always tested only on unseen data. The data sets used

were each divided into three parts at random. Training was performed using two

parts of the data, and the trained model was tested on the remaining one part.

This was done three times using a different part for testing. In this manner, the

model was tested on all data, and reported a number of correctly classified samples,

which was divided by the size of the data set to obtain percentage accuracy. The

choice of the fraction one third is a compromise between using all data to train,

which may result in over fitting the model, and using less data to be computation-

ally efficient.30 For each test, the time taken to train and the resulting accuracy

was measured.

3.4. Performance comparison

In order to put these results in context, some other classifier algorithms were com-

pared to CMAC. For this purpose, the Weka toolbox was used.31 As this toolbox

consists of programs written in Java, and using a common framework, it is possible

to make direct comparisons between running time. For this purpose, the KATA

CMAC algorithm was also coded in Java using the same libraries in order to pro-

vide the most realistic comparison. Initially, 12 algorithms were considered from thewide range provided by Weka. Some of these were discarded during tests because

their long running time did not provide a fair comparison with CMAC. Of the

available algorithms, the three fastest were selected: functions.RBFNetwork (place-

ment of Gaussian kernels using clustering), functions.SMO(the Sequential Minimal

Optimization version of Support Vector Classifiers), and trees.J48 (an implemen-

tation of the C4.5 decision tree algorithm). These three, as well as KATA CMAC,

were tested on the same data sets from the Parity problem described earlier, using

10-fold cross validation.

3.5. Results


10/16

308 D. Cornforth

(a) (b)

(c) (d)

(e) (f)

(g) (h)


11/16


(a) (b)

(c) (d)

(e) (f)

(g) (h)


12/16

310 D. Cornforth

(a) (b)

Fig. 6. Comparisons for the Parity problem using 2 dimensions and 5 classes. (a) showing trainingtime in seconds against number of samples (in thousands), (b) showing accuracy against numberof samples (in thousands).

the CMAC algorithm. Figure 6 shows the results for the Intrusion Detection prob-

lem. This supports the belief that CMAC classifiers have a linear time complexity

for training. In all the tests, the KATA algorithm was about 2.5 to 3 times as fast

as Albus. Note that in all these tests, the Albus training algorithm used a variable

number of training epochs, which explains in part the occasional outliers. So the

relative speed advantage of the KATA depends on the number of iterations of the

Albus algorithm. The accuracy obtained by training with the KATA is consistently

superior to that obtained using the Albus technique. There are two possible expla-

nations for this. First, when the problem becomes more difficult, using more classes

or dimensions, the performance of the classifier is bound to deteriorate, because the

number of samples available for each homogenous block of the input space decreases.

Since the Albus technique uses an error minimization, this is an inherently biased

model, whereas the KATA uses an unbiased model of input space. Therefore, the

accuracy of the classifier trained using the KATA degrades more slowly. Second,the Albus method suffers from the difficulty of correctly setting the parameter,

which is not necessary for the KATA.

Figure 7 shows results for the comparison between four classifier methods. Here,

all the methods chosen show evidence of linear time complexity. The algorithm

taking the longest time to build the model was J48, taking up to 10,000 seconds to

train on datasets near one million samples. The next slowest was SMO, taking up to

4000 seconds to train. CMAC was the algorithm with the fastest training time, and

RBF was very close. The slowest classifier for testing was RBF, taking 70 seconds

to classify. The next slowest was CMAC, taking up to 20 seconds to classify. Theother two methods, SMO and RBF were much quicker, classifying unknown cases

i l th 10 d


13/16


(a) (b)

(c) (d)

Fig. 7. Comparisons for the Parity problem using 2 dimensions and 5 classes. (a) training time inseconds against number of samples (in thousands), (b) testing time in seconds against number ofsamples (in thousands), (c) accuracy in percent correct against number of samples (in thousands),(d) legend.

SMO, and the close superposition of the boundaries in J48. The accuracy of the

RBF classifier is similar to that of CMAC. This is expected, since the similarity ofCMAC to RBF is well known.

It is clear that CMAC compares well with the other methods examined. It should

be noted that only the fastest classifier methods were examined, so it was possible

that CMAC would outperform other classifier methods on speed and accuracy.

4. Conclusions

There are three main results from this work. First, the training of CMAC-based

classifiers has linear time complexity. This is a highly desirable property of machinelearning techniques, as it makes the processing of large databases more computa-

ti ll f ibl S d t i i l ith ll th CMAC t b t i d


14/16

312 D. Cornforth

accountability for class assignment decisions and allows a priori probability to be

accounted for. This has potential in applications that require estimation of the risk

of incorrect classification.These three main results are supported by empirical evidence presented here.

Other results may be inferred from the nature of the algorithm, namely, that the

KATA is not sensitive to the order of training data, and is robust to outliers.

Comparative trials suggest that different classifiers have advantages in different

areas, but CMAC with KATA has the characteristic of fast training. This new

training algorithm has great potential for application in data mining and automated

knowledge discovery.

Acknowledgments

The author wishes to thank the New South Wales Centre for Parallel Computing

(NSWCPC) for the use of their SGI Power Challenge machine upon which the

calculations for this paper were performed. Part of this work was supported by a

Faculty Seed Grant from Charles Sturt University, and part was supported by a

Rectors Start-up Grant from the University of New South Wales.

References

1. T. G. Dietterich and G. Bakiri, Solving multiclass learning problems via error-correcting output codes, J. Artif. Intell. Res. 2 (1995) 263286.

2. J. Holland, Adaptation in Natural and Artificial Systems: An Introductory Analysiswith Applications to Biology, Control, and Artificial Intelligence (MIT Press, 1992).

3. R. O. Duda and P. E. Hart, Pattern Classification and Scene Analysis (John Wileyand Sons, New York, 1973).

4. J. Han and M. Kamber, Data Mining Concepts and Techniques (Morgan Kaufman,2001).

5. A. Roy, S. Govil and R. Miranda, A neural-network learning theory and a polynomialtime RBF algorithm, IEEE Trans. Neural Networks 8(6) (1997) 13011313.

6. J. He and X. Yao, Drift analysis and average time complexity of evolutionary algo-rithms, Artif. Intell. 127(1) (2001) 5785.

7. J. S. Albus, A new approach to manipulator control: The Cerebellar Model Articula-tion Controller (CMAC), J. Dynam. Syst. Measurement Contr. 97 (1975) 220233.

8. J. S. Albus, Mechanisms of planning and problem solving in the brain, Math. Biosci.45 (1979) 247293.

9. J. C. Santamaria, R. S. Sutton and A. Ram, Experiments with reinforcement learningin problems with continuous state and actions spaces, Technical Report UM-CS-1996-

088, Department of Computer Science, University of Massachusetts, Amherst, MA(1996).10. M. Wiering, R. Salustowicz and J. Schmidhuber, Reinforcement learning soccer teams


15/16


12. Z. J. Geng and W. Shen, Fingerprint classification using fuzzy cerebellar model arith-metic computer neural networks, J. Electron. Imag. 6(3) (1997) 311318.

13. H. Fashandi and M. Moin, Face detection using CMAC neural network, in Proc. 7thInt. Conf. Artif. Intell. Soft Comput. ICAISC, eds. L. Rutkowski, J. Siekmann, R.Tadeusiewicz and L. Zadeh, Lecture Notes in Computer Science, Vol. 3070 (Springer,2004) 724729.

14. W. Xu, S. Xia and H. Xie, Application of CMAC-based networks on medical imageclassification, in Proc. Int. Symp. Neural Networks, eds. F. Yin, J. Wang and C. Guo,Lecture Notes in Computer Science, Vol. 3173 (Springer, 2004) 953958.

15. D. Cornforth, Classifiers for machine intelligence, PhD thesis, Nottingham University,UK (1994).

16. T. H. Corman, C. E. Leirson and R. L. Rivest, Introduction to Algorithms (McGraw-Hill, 1986).

17. H. Lee, C. Chen and Y. Lu, A self-organizing HCMAC neural-network classifier, IEEETrans. Neural Networks 14(1) (2003) 1527.

18. C. Chen, C. Hong and Y. Lu, A pruning structure of self-organising HCMAC neuralnetwork classifier, in Proc. 2004 IEEE Int. Joint Conf. Neural Networks 2 (2004)861866.

19. P. C. E. An, W. T. Miller and P. C. Parks, Design improvements in associative memo-ries for cerebellar model articulation controllers, Proc. ICANN (1991), pp. 12071210.

20. S. H. Lane, D. A. Handelman and J. J. Gelfand, Theory and development of higher-order CMAC neural networks, IEEE Cont. Syst. (1992) 2330.

21. F. J. Gonzalez-Serrano, A. R. Figueiras-Vidal and A. Artes-Rodriguez, Generaliz-

ing CMAC architecture and training, IEEE Trans. Neural Networks 9(6) (1998)15091514.

22. H. Xu, C. Kwan, L. Haynes and J. Pryor, Real-time adaptive on-line traffic incidentdetection, Proc. IEEE Int. Symp. Intell. Contr. (1996), pp. 200205.

23. J. Geng and T. Lee, Freeway traffic incident detection using fuzzy CMAC neuralnetworks, Proc. IEEE World Congress Comput. Intell. 2 (1998) 11641169.

24. Y. Wong, CMAC learning is governed by a single parameter, in Proc. IEEE Int. Conf.Neural Networks, San Francisco (1993), pp. 14391443.

25. D. Cornforth and D. Newth, The kernel addition training algorithm: Faster trainingfor CMAC based neural networks, in Proc. Conf. Artif. Neural Networks Expert Syst.

(University of Otago, 2001), pp. 3439.26. S. Hettich and S. D. Bay, The UCI KDD Archive, University of California, Departmentof Information and Computer Science, Irvine, CA (1999), http://kdd.ics.uci.edu.

27. P. C. Parks and J. Militzer, Improved allocation of weights for associative memorystorage in learning control systems, in Proc. IFAC Design Meth. Contr. Syst., Zurich,Switzerland (1991), pp. 507512.

28. C. Lin and C. Chiang, Learning convergence of CMAC technique, IEEE Trans. NeuralNetworks8(6) (1997) 12821292.

29. S. Yao and B. Zhang, The learning convergence of CMAC in cyclic learning, Proc.Int. Joint Conf. Neural Networks 3 (1993) 25832586.

30. S. Weiss and C. A. Kulikowski (eds.), Computer Systems That Learn: Classification

and Prediction Methods From Statistics, Neural Nets, Machine Learning, and ExpertSystems (Morgan Kaufman, San Mateo, CA, 1991).

31 I Witt d E F k D t Mi i P ti l M hi L i T l d T h


16/16

Cerebellar Model Classifier for Data Mining With Linear Time Complexity

Documents

Transcript of Cerebellar Model Classifier for Data Mining With Linear Time Complexity