Statistical analysis of the parameters of a neuro-genetic ... · INTEREST in hybrid methods that...

1374 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 6, NOVEMBER 2002

Statistical Analysis of the Parameters of aNeuro-Genetic Algorithm

Pedro A. Castillo-Valdivieso, Juan J. Merelo, Alberto Prieto, Member, IEEE, Ignacio Rojas, and Gustavo Romero

Abstract—Interest in hybrid methods that combine artificialneural networks (ANNs) and evolutionary algorithms (EAs) hasgrown in the last few years, due to their robustness and abilityto design networks by setting initial weight values, by searchingthe architecture and the learning rule and parameters. However,papers describing the way genetic operators are tested to deter-mine their effectiveness are scarce; moreover, few researcherspublish the most suitable values of these operator parametersto solve a given problem. This paper presents an exhaustiveanalysis of the G-Prop method, and the different parameters themethod requires (population size, selection rate, initial weightrange, number of training epochs, etc.) are determined. Thepaper also the discusses the influence of the application of geneticoperators on the precision (classification ability or error) andnetwork size in classification problems. When making a detailedstatistical analysis of the influence of each parameter, the designershould pay most attention to the parameter presenting values thatare statistically most significant. The significance and relativeimportance of the parameters with respect to the results obtained,as well as suitable values for each, were obtained using ANalysisOf the VAriance (ANOVA). Experiments show the significanceof parameters concerning the neural network and learning inthe hybrid methods. Combining evolutionary algorithms andneural network learning methods can lead to better resultsthan using those methods alone. Moreover, making the networkinitial weights evolve is an important factor in the process. Theparameters found using this method were used to compare theG-Prop method both to itself with other parameter settings, andto other published methods.

Index Terms—ANalysis Of the VAriance (ANOVA), artificialneural networks (ANNs), evolutionary algorithms (EAs), hybridmethods, optimization, statistical analysis.

I. INTRODUCTION

I NTEREST in hybrid methods that combine artificial neuralnetworks (ANNs) and evolutionary algorithms (EAs) has

grown in the last few years, due to their robustness and ability todesign networks by setting initial weight values, searching thearchitecture, the learning rule, and parameters.

Nevertheless, papers describing in detail the way genetic op-erators are tested to determine their effectiveness (searching forthe best network) are scarce; moreover, few researchers publishthe most suitable values of operator parameters to solve a givenproblem.

Manuscript received October 18, 2001; revised May 8, 2002. This paper wassupported in part by CICYT under Projects TIC99-0550, IST-1999-12679, andINTAS 97-30950.

The authors are with the Department of Architecture and Computer Tech-nology, University of Granada, Campus de Fuentenueva, E–18071 Granada,Spain (e-mail: [email protected]).

Digital Object Identifier 10.1109/TNN.2002.804281

This paper includes an exhaustive analysis of the G-Propmethod [1]–[3], designed to determine the learning rate, theinitial weights, and a suitable hidden-layer size of multilayerperceptrons (MLPs), setting the parameters that the methodrequires. The genetic operators designed are evaluated usingclassification problems.

In G-Prop, we make use of the capabilities of two classes ofalgorithms: the ability of EA to find a solution close to the globaloptimum, and the ability of the quickpropagation (QP) algo-rithm to tune a solution and to reach the nearest local minimumby means of local search from the solution found by the EA. In-stead of using a preestablished topology, the population is ini-tialized with different hidden layer sizes, with specific operatorsdesigned to modify them during evolution. We used mutation,multipoint crossover, addition, elimination, and QP training asthe variation operators. Thus, the EA searches and optimizes thearchitecture (number of hidden units), the initial weight settingsfor that architecture, and the learning rate for that net.

The main advantage of G-Prop is that no MLP parametershave to be set by hand (except the number of training epochsto evaluate the population individuals, since, depending on theproblem, a higher or lower value will be necessary; this aspect isstudied in the present paper). Obviously, the EA constants (someof them concerning the initial population, and others concerningthe genetic operators) need to be set. This algorithm has beenpresented in previous articles, and applied to different kinds ofproblems [1]–[3]. This paper is intended to comprise a moreexhaustive study and to enable general conclusions to be drawn.

Genetic algorithm users adjust the main design parameters ofan EA (crossover probability, mutation probability, populationsize, number of generations, and selection rate) by hand [7], [8].The decision as to which values are best is usually made in termsof the most common values or experimental formulas given inthe bibliography, or by trial and error [9], [10]. Nevertheless, itis very important to know which parameter values involved inthe design of an EA have the greatest influence on its behaviorand performance. When making a detailed statistical analysis ofthe influence of each parameter, the designer should pay mostattention to the parameter providing the values that are statisti-cally most significant.

The goal of this paper, thus, is to gain a better insight into themost relevant aspects of EA design, to achieve a better imple-mentation of the elemental operations.

The results obtained could be generalized to other neuroge-netic algorithms, and even to other kind of hybrid algorithms.

In order to determine the most important parameters (in thesense of their influence on the results), and to establish the mostsuitable values for such parameters (thus obtaining an optimal

1045-9227/02$17.00 © 2002 IEEE

CASTILLO-VALDIVIESO et al.: STATISTICAL ANALYSIS OF THE PARAMETERS 1375

operation), the ANalysis Of the VAriance (ANOVA) [11], [12]statistical method was used. This statistical tool, based on theanalysis of the mean variance, is widely used.

The theory and methodology of ANOVA was mainly devel-oped by Fisher during the 1920s [11], [12]. ANOVA examinesthe effects of one, two, or more quantitative or qualitative vari-ables (called factors) on one quantitative response. ANOVA isuseful in a range of disciplines when it is suspected that one ormore factors might affect a response. ANOVA is essentially amethod of analyzing the variance to which a response is sub-ject, dividing it into the various components corresponding tothe sources of variation, which can be identified.

The ANOVA method allows us to determine whether achange in the responses is due to a change in a factor or due toa random effect. Thus, it is possible to determine the variableswith greatest effect on the method that is being evaluated.

The basic assumptions made to apply this statistics tool sat-isfactorily [4], [5] are that the observations obtained should

• be mutually independent;• be distributed according to a normal distribution;• all have the same variance ;• have averages that can be expressed as a linear combina-

tion of certain unknown parameters.These ANOVA assumptions must be verified by studying the

residuals [6] in order to detect discrepancies between the ex-perimental data and the output of the model. The assumptionsof the model are fulfilled when the residuals vary in a randomway. Therefore, the statistical analysis in Section IV-B was per-formed after a preliminary inspection of the residuals, which al-though not described in the present paper, show the correctnessof the model applied.

As an example, assume a single case in which two factorsaffect the outcome of the experiment. We denote the value ob-served by ( ; ) when the firstfactor is at the th level and the second at theth level. It is as-sumed that the two factors do not act independently and, there-fore, that there exists an interaction between them. In this case,the observations fit the following equation:

(1)

where is the fixed effect that is common to all the populations,is the effect associated with theth level of the first factor,

and is the effect associated with theth level of the secondfactor. The term denotes the joint effect of the presenceof level of the first factor and level of the second one; thisterm, therefore, is named the interaction term. The termis the influence on the result of everything that could not beassigned, or of random factors.

With ANOVA, we test a null hypothesis that all of the pop-ulation means are equal against the alternative hypothesis thatthere is at least one mean that is not equal to the others. Wefind the sample mean and variance for each level (value) of themain factor. Using these values, we obtain two different esti-mates of the population variance. The first one is obtained byfinding the sample variance of the sample means from theoverall mean. This variance is referred to as the variance amongthe means. The second estimate of the population variance is

found by using a weighted average of the sample variances. Thisvariance is called the variance within the means.

The estimations ANOVA offers are based on the value of sta-tistical

(2)

where is the sum of the squares of the observations accordingto the levels (values) of all the factors and is the sum of thesquares according to each level, both of which are divided bythe number of degrees of freedom (DF).

Once calculated, is compared with the -Snedecor dis-tribution [13], [14] with the appropriate number of degrees offreedom, to obtain a significance value (Sig. Level). If this levelis lower than 0.05, then the influence of the factor is statisticallysignificant at the confidence level of 95%.

Besides the ANOVA method, the statistical analysis toolANalysis Of Mean (ANOM) is used. This technique uses theaverage values for each parameter level (value) and the maineffect (on the responses) plots, to decide the most suitablevalues for each parameter.

In those cases where it is only necessary to evaluate the differ-ence of means between two groups, statistical t-Student tests areused, as these can be used even on small samples. This methodcalculates a value that represents the error probability if thenull hypothesis is accepted, that is, the error probability sup-posing that there is no difference between the levels of the ob-servations in the population.

The rest of this paper is structured as follows: Section IIpresents a comprehensive review of the approaches found in thebibliography to describe different application parameters anddetermine the most suitable values for these parameters. Sec-tion III describes the G-Prop algorithm and the proposed geneticoperators. Section IV contains an exhaustive analysis of themethod, showing how the selection percentage is determined,and evaluates the proposed genetic operators, using classifica-tion problems. We study the evolution of the precision (classifi-cation ability or error) and network size, and compare the resultsobtained with those available in the bibliography. Finally, a briefconclusion is presented in Section VI.

II. STATE-OF-THE-ART

If an EA is to be used, components such as genetic operators,selection and replacement mechanisms, and the initial popula-tion must first be chosen. The parameters of some of these ele-ments (mutation probability, selection and replacement rate, orthe population size) determine the way they operate and influ-ence the results obtained. Obtaining suitable values for the pa-rameters is an expensive, time-consuming, and laborious task.

One of the most common ways of setting these parametersis by hand, after intensive experimentation with different values[15]. An alternative is to use an EA that acts as a “meta-evolu-tionary algorithm” to optimize the EA parameters. The idea isto run a higher level EA that searches for a good set of parame-ters for a lower level EA [10], [16]. The cited authors search foran optimal and general set of parameters to solve a wide rangeof optimization problems.


However, as some authors remark, solving specific problemsrequires specific parameter value sets [17]–[19] and, as Harik[20] claims, nobody knows the “optimal” parameter settings foran arbitrary real-world problem. So, establishing the optimaland general set of parameters is a difficult problem.

Other researchers have proposed determining a good setof EA parameters by analogy, making a theoretical analysis[21]–[28]. Establishing parameters by analogy means usingsuitable sets of parameters to solve similar problems. However,they do not explain how to measure similarity between prob-lems. Neither is a clear protocol proposed for situations whenthe similarity between problems implies that most suitable setsof parameters are also similar [18], [19].

Finally, some authors have proposed practical approachesthat eliminate the need for parameter search in GA [20]. Inthese works, a set of parameters is found, but instead of findingthem by means of intense experimentation, the parametersettings are backed up by a theoretical work, meaning that thesesettings are robust.

III. EA U SED IN THE G-PROPMETHOD

The complete description of the method and the results ob-tained using classification problems have been presented else-where [1]–[3], and so this section only presents a brief descrip-tion of the method and its components.

The algorithm is specified in the following pseudocode.

1. Generate the initial population with MLPs

with random weight values in a specified

range and random hidden layer sizes.

2. Repeat for n generations:

(a) Evaluate the new MLP’s (individuals):

train them using the training set and obtain

their fitness according to the number of cor-

rect classifications on the validation set

and the network size (number of weights).

(b) Select the s best individuals in the pop-

ulation, according to their fitness, to mate

using the genetic operators to obtain the new

individuals.

(c) Replace the s worst individuals in the

population by the new ones.

3. Use the best individual found to obtain the

testing error using the test set.

A. Selection Algorithm

A steady-state [29] algorithm was used because it was em-pirically found to be faster at obtaining solutions than other se-lection algorithms, making a more exhaustive exploration of thesearch space [30].

For each generation, the bestindividuals of the population,those whose fitness is highest, are chosen to mate, using thegenetic operators. The offspring replace theworst individualsof the current generation.

B. EA Population

In G-Prop, an individual is a complete MLP with a hiddenlayer. An EA needs the individuals of the population to be en-coded as chromosomes to be handled by the genetic operatorsof the EA. But G-Prop uses neither binary nor real encoding[31] (representation of the networks in a binary or real numberstring) nor indirect coding [32]; instead, the initial parameters ofthe network (initial weights and learning constants) are evolvedusing specific genetic operators, such as mutation, crossover,addition, and elimination of hidden neurons and the training(QP) operator, which act on the MLP data structure [1]–[3].

C. Genetic Operators

The genetic operators act directly on the ANN object (in-stead of performing neuron-level evolution [33]), but only initialweights and the learning constant are subject to evolution, notthe weights obtained after training.

In order to calculate the fitness, a clone of the MLP is created,and, thus, the initial weights remain unchanged in the originalMLP.

Weight changes in the training phase are not saved back intothe individual’s “genetic code” (the data structure that repre-sents the individual), but rather when the training operator isused. In this case, changes are saved back into the individualgenetic code that remains in the population.

When a genetic operator changes an MLP, it considers eachhidden neuron (and its input and output weights) as a “gene,” sothat, if two MLPs are crossed, complete hidden layer neuronsare interchanged (and weights to and from it are treated as oneunit), as proposed in [34]–[36].

Five operators were designed to evolve MLPs. Due to the waythe genetic operators are applied, an operator application pri-ority must be specified to the EA, so that with each generation,before applying the genetic operators, all operator priorities arenormalized, dividing them by the accumulated total priority toobtain the operator application probability. Thus, if all operatorpriority is equal to one, and five operators are used, then op-erator probabilities will be 20%. In this way, the percentage ofoffspring generated using each genetic operator is specified tothe method.

1) Mutation: This operator randomly changes the weightsof certain neurons, depending on the application percentage.

This operator resembles that of Montana and Davis [37] andthe algorithm presented by Kinnebrock in [38]. The latter pro-posed applying a mutation operator after each presentation ofthe training data set (after each training epoch). The presentmodifications consist of changing the neural-network weights(the percentage of weights being given by the mutation proba-bility) adding a small random number uniformly distributed ina specified interval (the limits of which are established later).

Mutation also affects the learning rate, which is modified byadding a small random number that followsuniformdistributionin a specified interval (established in Section IV-B).

2) Crossover: This operator performstwo point crossoverbetween two MLPs to produce two new MLPs. The hidden layerneurons are a mixture of those of the two parents: some of thehidden neurons, together with thein andout connections from


each parent, make up one offspring, and the remaining hiddenneurons comprise the other.

Finally, the learning rate is swapped between the two nets.Some researchers [39], [40] propose avoiding the use of

crossover operators in ANN optimization problems, since thedistributed way the network “stores information” makes itdifficult to decide what is the “building block” that must beinterchanged between the networks. It has been claimed thatthe way the operator works greatly changes the network, andthat it acts as a “macromutation” operator. In Section IV-B, wetest whether the operator acts by recombining MLP “buildingblocks” or on the contrary, its effect is that of a macromutation.Its effectiveness within the G-Prop method is then verified.

If the method works in a similar way whatever the applicationrate, this indicates that this operator causes macromutations thatmake the method explore the search space more widely (intro-ducing diversity into the EA population).

3) Training: This operator is used to improve the individualMLP through a local search (applying QP), as suggested in [37]and [41], who proposed applying backpropagation (BP) as avariation operator that tunes an ANN.

QP was chosen as the gradient descent method for MLPtraining, as opposed to others (e.g., RPROP [42], [43] and theDelta-Bar-Delta rule [44]) due to its higher training speed andits ability to avoid local optima. With QP, only the learning ratemust be optimized, although whatever training algorithm used,the learning rate must be optimized in the same way.

When it is applied, the operator takes an MLP that is trainedfor a specified number of epochs, and returns it, with thetrained weights, to the population. QP is an adaptive learningrate adjustment method. Therefore, when it is applied, the MLPlearning rate changes and remains modified in the population.

This could called a Lamarckian strategy [45], since the char-acteristics acquired during the individual’s life stay in the “ge-netic code” so that the offspring can inherit its characteristics.Everything an individual learns during its life remains in thepopulation. Thus, the offspring generated have the classifica-tion ability of their parents, in addition to that obtained duringtheir lives.

4) Adding Hidden Neurons:This operator and the followingone (elimination) attempt to solve one of the main problemsof BP and its variants: the difficulty of guessing the numberof hidden layer neurons. By adding hidden neurons, it is notnecessary to set the size of the EA search space.

This operator is intended to perform incremental design: itstarts with a small structure and increments it, if necessary,by adding new hidden units. Now, however, there arises thedilemma of overfitting: small networks generalize well, butare slow at learning, whereas big networks learn fast (needingfewer training epochs to obtain similar precision), but gener-alize badly [46], [47]. The way to obtain good generalizationability is to use the smallest network that can learn the inputdata efficiently [48]–[50].

For this reason, this operator is combined with the followingone. In addition, it is compensated with the component of thefitness function that penalizes the network size.

5) Removing Hidden Neurons:When applied, this operatorremoves a hidden layer neuron. This operator is intended to per-

form decremental ANN design: it prunes certain nodes and, ifthis is performed properly, obtains better generalization resultsand a smaller network [51], [52], [47]. This means, to a certainextent, that the networks are prevented from growing too muchand reaching an excessive size.

The operator works in the following way: the unit to be re-moved is chosen taking into account the accumulated error onall the training patterns (according to the BP algorithm). Thiscan be seen as a Lamarckian strategy [45] (like the one usedin Section III-C3), since it makes use of information about thenetwork to carry out directed elimination of a hidden unit aftertraining.

IV. EVALUATION

In this section, the ANOVA statistical tool is applied todetermine whether the influence of a change in the parametervalues (factors) is significant in the obtained error and/orsize (responses), to establish the most suitable value for theseparameters (to obtain good solutions as fast as possible), andto decide whether the element operation (genetic operator,selector, etc.) is as expected. In most of the cases, the goal is toobtain the smallest generalization error, while keeping the sizeas small as possible.

In previous studies, G-Prop was applied to solve differentproblems [1]–[3]. In these studies, parameter values were ob-tained heuristically; after many simulations a set of values called“by default” was obtained (same genetic operator applicationpriority, 100 generations, 100 training epochs, etc (see [1]–[3]for details). In this paper, our aim is to find the “best” set of pa-rameters using robust statistical tools.

By these means, we seek to determine an adequate value forthe following EA parameters.

• Number of EA generations: or number of evaluation–re-production–replacement cycles to search for new solu-tions. As more generations are used, a better solution ismore likely be reached. However, computation time in-creases, and so a compromise is necessary.

• Population size: although a large population impliesgreater diversity and a higher probability of finding agood solution, the candidate solution evaluation phase istime consuming, and so a small population is desirable(providing a good solution).

• Initialization weight range: researchers have proposed dif-ferent values for this parameter [53]–[58], and so it is de-sirable to establish a suitable range. Depending on the ini-tial point of the space search, determined by the set ofweights, better or worse solutions are obtained [53].

• Selection rate: this parameter establishes the number ofindividuals the genetic operators are applied to, in orderto generate the offspring for the succeeding generation.This parameter is important, because it establishes thenumber of new individuals generated each generation, andrequiring more or less time to obtain good results.

• Genetic operators application priority: the genetic opera-tors must be evaluated to verify that they work as expected.Thus, an application priority must be determined for eachoperator and also, for the mutation operator, the mutation


probability, the weight mutation range and the learningconstant mutation range.

• Number of training epochs: when evaluating an MLP, acopy of the individual in the population is obtained, sothat after training (for this number of epochs) the trainedweights are not saved back to the population. Once trained,the fitness function value is obtained using the validationset. The more epochs used, the better the results obtained;however, since the individual (MLP) evaluation is the mostexpensive (time-consuming) phase and depends directlyon the number of epochs, it is necessary to establish a suit-able value to minimize computation time while keepingerrors low.

In order to evaluate the above parameters related to thepopulation and the genetic operators, we used the patternclassification problem called “Glass.” This problem, proposedby Prechelt in [59] and used by Grönroos in [60], consists ofclassifying different types of glass. The results of a chemicalanalysis of glass splinters (percentage content of eight differentelements) together with an index of refraction, are used to clas-sify the sample as either float-processed or nonfloat-processedbuilding windows, vehicle windows, containers, tableware,or head lamps. This task was motivated by forensic needs incriminal investigation. The dataset was created based on theglass problem dataset from the UCI repository of machinelearning databases. The dataset contains 214 instances. Eachsample has nine attributes plus the class attribute: refractiveindex, sodium, magnesium, aluminum, silicon, potassium,calcium, barium, iron, and the class attribute (type of glass).The Glass1a set was used for the present experiments becausethis is the set (among three proposed by Prechelt in [59]) usedby Grönroos in [60], which is later used for comparison.

A. Experimental Setup

This section describes in detail the set of tests carried out toobtain the necessary data to apply the ANOVA method and, thus,to determine the most suitable parameter values.

We intend to determine if the influence of a change in a pa-rameter value is significant in the error and/or ANN size ob-tained. We also wish to establish the most suitable parametervalues, to obtain good solutions as fast as possible, and to deter-mine whether the elements (genetic operator, fitness function,etc.) work as expected.

The number of parameters to evaluate is high: three relatedto the initial population, five to determine the number of indi-viduals generated using each operator, and four to define thebehavior of the genetic operators.

Thus, to avoid making a high number of simulations (eachparameter is evaluated at three or four different levels), theANOVA method is applied grouping the parameters as follows.

• A set of simulations to determine the effect of the geneticoperator application priorities. Table I shows the differentlevels used to evaluate these parameters. After making the1024 simulations (five operators with four levels each)that represent all the possible combinations, ANOVA willbe applied to determine the effect of the genetic oper-ator application priorities on the error and network size.

TABLE IPARAMETERS (AND THE ABBREVIATIONS

USED TO REFERENCETHEM LATER) THAT DETERMINE THE GENETIC

OPERATORAPPLICATION PRIORITIES AND VALUES USED TOAPPLY ANOVA

TABLE IIPARAMETERS (AND THE ABBREVIATIONS USED TOREFERENCETHEM LATER)

THAT DETERMINE THE MUTATION OPERATOROPERATION AND VALUES

USED TOAPPLY ANOVA

ANOM is then used to determine the most suitable param-eter value.

• The mutation operator takes as parameters the mutationrate (mutation probability), that indicates the percentageof weights that will be mutated (see Section III-C1); alsothe range of generation of random numbers that will beadded to the weight values to mutate them (weight muta-tion range), and range of generation of random numbersthat will be added to the learning constant value to mutateit (learning constant mutation range). Therefore, a set ofsimulations (once the most suitable values for the operatorapplication priorities have been established) will be car-ried out to determine the most suitable parameter valuesthat control how the mutation operator works. Table IIshows the different levels used to evaluate these param-eters: The value 0.005 for weight mutation range (for ex-ample) means the random numbers added to the weightvalues to mutate them are in the interval [0.005,0.005].As in the first experiment, ANOVA is used to determinewhether the effect of these parameters on the error and sizeobtained is statistically significant, the most suitable valueis determined using the ANOM tool.

• Once the values for the optimal operation of the geneticoperators are established, a set of simulations to evaluatethe parameters related to the initial population (initial pop-ulation size, initial weight generation range and selectionrate) will be carried out. Table III shows the differentlevels used to evaluate these parameters: The value 0.05for weight generation range (for example) means initialweights are in the interval [ 0.005,0.005]. As in previouscases, ANOVA is used to determine whether the effect ofthese parameters on the error and size obtained is statis-tically significant, while the most suitable value is deter-mined using the ANOM tool.


TABLE IIIPARAMETERS (AND THE ABBREVIATIONS USED TOREFERENCETHEM LATER)

RELATED TO INITIAL POPULATION, AND VALUES USED TOAPPLY ANOVA

TABLE IVLEVELS USED TOAPPLY ANOVA TO STUDY THE NUMBER OF GENERATIONS

AND THE NUMBER OFTRAINING EPOCHS(AND THE ABBREVIATIONS USED TO

REFERENCETHEM LATER)

• Finally, once the values for all the above parameters havebeen obtained, the number of generations of the EA andthe number of training epochs for the evaluation of eachMLP when fitness is calculated are evaluated. Table IVshows the proposed levels to evaluate these parameters: Asin previous cases, ANOVA is used to determine whetherthe effect of these parameters on the error and size ob-tained is statistically significant; the most suitable value isdetermined using the ANOM tool.

B. Statistical Study Using ANOVA for the EA Parameters

The results obtained using ANOVA for each of the aboveexperiment are considered to determine their importance andthe most suitable parameter value. The figures shown in the ta-bles are those given by the Minitab [61] program (used to applyANOVA and ANOM methods). In order to show results as ob-tained using the statistical tools, the errors and standard devia-tions do not appear in tables in the standard form.

The ANOVA table shows, for each factor, the number of DF,the sum of squares (Sum of Squares), the mean of the squares(Mean Square), the value of the statisticaland its significancelevel (Sig. Level); if the latter is smaller than 0.05, then thefactor effect is statistically significant at the level of confidenceof 95%. These significant factors and their significance levelsare highlighted in boldface in the tables.

1) Evaluation of Genetic Operator Application Prior-ities: The first experiment consisted of determining theimportance of the genetic operator application priorities (whichact as factors to apply ANOVA). Thus, 1024 simulations werecarried out, corresponding to all the possible combinationsof application of the four levels for each of the five operators(see Table I), to obtain the error and network size for eachcombination.

TABLE VANOVA TABLE FOR THEERROR(RESPONSE) WITH THE GENETIC OPERATORS

APPLICATION PRIORITIES ASFACTORS. THE RESULTSWITH A SIGNIFICANCE

LEVEL OVER 95% ARE HIGHLIGHTED IN BOLDFACE

TABLE VIANOVA TABLE FOR THE SIZE (RESPONSE) WITH THE GENETIC OPERATORS

APPLICATION PRIORITIES ASFACTORS

The ANOVA results, for the error and size, are given in Ta-bles V and VI.

The ANOVA analysis shows how the training operator ap-plication priority (trn) influences both the error and the net-work size obtained, which indicates that changes in this param-eter influence the results significantly. This relation is not soclear for the other parameters, although according to ANOMTables VII–XI, where the errors are shown as the normalizedmean squared error (NMSE) for each level, and the sizes aremeasured as the number of weights of the network or number ofparameters), different levels of application are associated withhigher or lower errors and sizes.

As can be seen in the ANOM tables (see also their effectsgraphs and Fig. 1), the greatest effects are those of the trainingoperator priority, whereas the eliminating hidden neurons op-erator effects are the smallest. Moreover, as the mutation oper-ator application priority increases, the average error and networksize decrease. As the crossover operator application priority in-creases, the average error and size also increase, which suggeststhat this operator does not have beneficial effects on the evolu-tion, but rather acts as a macromutation operator, causing greatchanges (changing the search-space area, see Section III-CII).This confirms results presented by other researchers [62], [63].

As far as size is concerned, the elimination operator worksas expected, obtaining lower sizes when its application priorityincreases (obtaining, at the time, worse error results, since theelimination of hidden neurons can lead to network degradation).Moreover, the average size increases as the application priorityof the adding neurons operator increases. At the same time, theerror obtained is smaller (only with a suitable number of hiddenlayer neurons will the MLP learn correctly [46], [47]).

The intensive use of the training operator (applying QP asa genetic operator to carry out local search) increases the av-erage error and network size, due to the loss of diversity asso-ciated with a high application level, which has a negative ef-fect on the behavior of the EA. This loss of diversity occurs


TABLE VIIANOM TABLE (TOP) FOR THE DIFFERENT LEVELS OF THEMUTATION OPERATORAPPLICATION PRIORITY, AND ITS EFFECTS ON THEERROR

(BOTTOM-LEFT) AND SIZE (BOTTOM-RIGHT)

TABLE VIIIANOM TABLE (TOP) FOR THE DIFFERENT LEVELS OF THECROSSOVEROPERATORAPPLICATION PRIORITY, AND ITS EFFECTS ON THEERROR


because improvements introduced by the local search impedeevolution, overriding size changes introduced by other opera-tors with training-operator changes.

Fig. 1(a) and 1(b) shows the effects of the genetic operatorsapplication priorities on (a) the error and (b) size. A commonscale is used for these graphs in order to highlight influences thatcould not be shown in Tables VII–XI. Taking into account thesegraphs and how the method performs when an operator is usedwith priority 0.0, the mutation and addition could be termed themost important operators, because when they are used with lowpriority the error increases, but when the other operators are

used with low priority the error is not so great. In this sense,the training and crossover operators are the least important.

2) Evaluation of the Mutation Operator Application Param-eters: The second experiment consisted of determining the mu-tation operator application parameter effects (the mutation prob-ability, weight mutation range, and learning constant mutationrange acts as factors to apply ANOVA). Thus, 64 simulationswere carried out, corresponding to all the possible combinationsof application of the four levels for each of the three parameters(Table II), obtaining the error and network size for each combi-nation.


TABLE IXANOM TABLE (TOP) FOR THE DIFFERENT LEVELS OF THEADDITION OPERATORAPPLICATION PRIORITY, AND ITS EFFECTS ON THEERROR


TABLE XANOM TABLE (TOP) FOR THEDIFFERENTLEVELS OF THEELIMINATION OPERATORAPPLICATION PRIORITY, BOTTOM-LEFT) AND SIZE (BOTTOM-RIGHT)

ANOVA results, for the error and size, are given in Tables XIIand XIII.

ANOVA analysis shows that the learning constant mutationrange has a greater effect on the error, which indicates thatchanges in this parameter significantly influence the results.It is not possible to affirm that the other parameters affect theresponses.

Tables XIV–XVI show the ANOM results (error obtained asthe NMSE and the size measured in number of weights or pa-rameters, together with the effects graphs) for these three pa-rameters.

Fig. 2(a) and 2(b) shows the effects of the mutation operatorapplication parameters on (a) the error and (b) size. A common

scale is used for these graphs in order to highlight influencesthat could not be shown in Tables XIV–XVI.

Tables XIV–XVI and Fig. 2 show that learning constant mu-tation range parameter effects are higher than those of the otherparameters, while mutation probability effects are the least ev-ident. As the mutation probability increases, so does the errorobtained, due to the fact that large changes may lead to networkdisruption. Furthermore, a small mutation probability provokesa large error, because the operator does not make the necessarychanges to improve the set of weights.

As far as size is concerned, no parameter has a decisive effecton the network size obtained.

The most suitable mutation range is the medium level: nei-ther small values (which would lead to low diversity), nor high


TABLE XIANOM TABLE (TOP) FOR THE DIFFERENT LEVELS OF THETRAINING OPERATORAPPLICATION PRIORITY, AND ITS EFFECTS ON THEERROR


(a)

(b)

Fig. 1. ANOM analysis of effects of the genetic operators application priorities on (a) the error and (b) size.

values (which would cause weight saturation when training withbackpropagation).

3) Evaluation of the Initial Population Parameters:We de-termined the importance of the initial population parameters:


TABLE XIIANOVA TABLE FOR THEERROR(RESPONSE) WITH THE MUTATION OPERATOR

APPLICATION PARAMETERS ASFACTORS

TABLE XIIIANOVA TABLE FOR THE SIZE (RESPONSE) WITH THE MUTATION OPERATOR

APPLICATION PARAMETERS ASFACTORS

population size, initial weight range and selection rate or per-centage of individuals replaced in each generation. Each of theseparameters is a factor when ANOVA is applied. Thus, 48 simu-lations were carried out, corresponding to all the possible com-binations of the three parameters (Table III), to obtain the errorand network size for each combination.

ANOVA results, for the error and size, are given in Ta-bles XVII and XVIII.

ANOVA analysis shows that a higher initial weight rangeand selection rate increases the error obtained. Thus, changesin these parameters significantly affect the results. Such a rela-tion was not found for the population size, although according tothe ANOM table (Table XIX), there is a clear relation betweenpopulation size, the obtained error, and the network size.

The ANOM results for the population parameters (obtainederror as NMSE and the size measured in number of weightsor parameters) are shown in Tables XIX–Table XXI (and theireffects graphs).

Fig. 3(a) and 3(b) shows the effects of the population param-eters on (a) the error and (b) size. A common scale is used forthese graphs to highlight influences that could not be shown inTables XIX–XXI.

Tables XIX–XXI and Fig. 3 show that the initial weight rangeand selection rate effects are greater than those of the popula-tion size. Moreover, as the population size increases, the averageerror and network size decrease. This is due to the fact that thehigher the population size, the higher the probability of findinggood solutions.

Nevertheless, if the selection rate increases too much, the cre-ation (and replacement) of many new individuals in each gener-ation of the EA will introduce too much diversity into the popu-lation, and then both the error and the network size increase (theEA becomes a random search algorithm).

The most suitable initial weight range, according to theanalysis, is an interval close to [ 0.05,0.05], which wouldbe in agreement with the results presented by Fernández andHernández in [57] and [58]. Furthermore, this is in accordancewith the low value obtained for the weight mutation range:initial weights are small in the first generation (the initial

weight range is small) and they are changed smoothly by theEA.

4) Evaluation of the Number of Generations and Number ofTraining Epochs: A further objective of this study was to deter-mine the effects of the number of generations and the number oftraining epochs (which act as factors when fitness is calculated)on the error and network size (responses). Both parameters di-rectly affect computation time, increasing it as their values in-crease, and so it is advisable to use the lowest possible level forboth parameters, compatible with obtaining good results. Six-teen simulations were carried out, corresponding to all the pos-sible combinations of application of the four levels for each ofthese parameters (Table IV), to obtain the error and network sizefor each combination.

The ANOVA results for the error and size are shown in Ta-bles XXII and XXIII.

ANOVA analysis shows that both parameters significantly af-fect the error obtained, although the effect of the number oftraining epochs (significant at the confidence level of 99%) isgreater than that of the number of generations (significant at theconfidence level of 90%). Nevertheless, neither parameter pro-duces a statistically significant effect on network size.

The ANOM results (obtained error as NMSE and the sizemeasured in number of weights or parameters) for the numberof generations and the number of training epochs are shown inTables XXIV and XXV.

Fig. 4(a) and 4(b) shows the effects of the number of genera-tions and training epochs on (a) the error and (b) size. A commonscale is used for these graphs in order to highlight influences thatcould not be shown in Tables XXIV and XXV.

As expected (Tables XXIV and XXV and Fig. 4), increasingthe value of both parameters reduces average errors: as thenumber of generations increases, good solutions are morelikely to be found; and the higher the number of training epochsto calculate the fitness, the better the results obtained. Thevariation obtained by changing the number of training epochs ismore significant, and so the number of training epochs shouldbe increased for optimal results.

Nevertheless, the effect of these parameters on network sizeis not so clear, although as the number of generations grow, thenumber of times the addition operator is applied increases, andthe average network size will grows. It is advisable to applythe addition operator with a higher application priority than theelimination operator.

5) Conclusions:With regard to the experiments that de-termine the genetic operators application, ANOVA analysisshows how the training operator application priority (QP) andthe learning constant mutation range have the greatest effecton the error and size obtained. Application of the ANOM toolshows that the higher the mutation operator application priority,the lower the average error and size obtained. Moreover, thehigher the crossover operator application priority, the higherthe average error and size obtained (this operator acts as amacromutation operator). With respect to the mutation operator,as the mutation probability is higher, the obtained error grows,because excessively large changes may lead to disruption ofthe neural network, while changes which are too small canlead to modifications that do not improve the weight set. The


TABLE XIVANOM TABLE (TOP) FOR THEDIFFERENTLEVELS OF THEMUTATION PROBABILITY , AND ITS EFFECTS ON THEERROR(BOTTOM-LEFT) AND SIZE (BOTTOM-RIGHT)

TABLE XVANOM TABLE (TOP) FOR THE DIFFERENT LEVELS OF THE WEIGHT MUTATION RANGE, AND ITS EFFECTS ON THEERROR


adding and eliminating operators work as expected, increasingor decreasing the average network size as their applicationpriority is increased or reduced.

ANOVA analysis of the experimental data concerning the ini-tial population parameters shows that the initial weight rangeand the selection rate have a statistically significant effect onthe error obtained. The ANOM study shows that as the popula-tion size grows, both the error and network size decrease.

For the third set of experimental results (studying the numberof generations and number of training epochs), the ANOVAanalysis shows that both parameters significantly affect the errorobtained (the higher the value of these parameters, the lower theerror obtained).

As ANOVA analysis shows, parameters related to the trainingoperator application (priority and number of training epochs), orto the learning constant changes, cause modifications that arestatistically significant, whereas this cannot be said for otherparameters (which are not statistically significant at a low levelof confidence).

The fact that parameters which control the MLP withoutacting on the EA significantly affect the outcome of the algo-rithm indicates that these parameters are even more important inthe hybrid method (contrary to the opinion of some researchers,who have proposed using only genetic algorithms (GAs) totrain MLP, without using gradient information [38], [64]–[67]).This consideration justifies our target of trying to optimize all


TABLE XVIANOM TABLE (TOP) FOR THE DIFFERENT LEVELS OF THELEARNING CONSTANT MUTATION RANGE, AND ITS EFFECTS ON THEERROR


(a)

(b)

Fig. 2. ANOM analysis of effects of the mutation operator application parameters on (a) the error and (b) size.

the parameters of the MLP with the EA, thus avoiding the needto determine any one of them by hand, since these have been

shown to be the most important (they produce large variationsin the results when small changes in the parameters are made).


TABLE XVIIANOVA TABLE FOR THE ERROR(RESPONSE) WITH THE POPULATION

PARAMETERS ASFACTORS

TABLE XVIIIANOVA TABLE FOR THE SIZE (RESPONSE) WITH THE POPULATION

PARAMETERS ASFACTORS

Even if the quantitative results obtained do not hold for anyproblem or any ANN, the qualitative results will probably be thesame: the parameters related to the ANN have the greatest in-fluence on error, and influence on error, determined by ANOMfor all the parameters, will increase or decrease in roughly thesame way.

C. Comparison of Results Obtained With Those Presented byOther Authors

The results of the above analysis are shown in Table XXVI.Of course, depending on the difficulty of the problem, the

number of generations and the population size needed forgreater diversity should be higher or lower.

In order to determine whether the parameter values obtainedare the most suitable, a set of 50 simulations was preformed tosolve the Glass1a classification problem.

The t-Student statistical tests were used to check the validityof the results obtained (average and standard deviation), and totest whether differences among them were significant.

Table XXVII shows the evaluation results of G-Prop (usingthe proposed parameter values), for the Glass1a problem, andcompares them with those obtained using QP, those presented byPrechelt [59] and those presented by Grönroos [60]. This tableshows the average classification error on the test set, the networksize (measured as number of weights or number of parameters),the error and size corresponding to the individual that obtaineda smaller error, and the network size of the smallest individual(and its error).

The results presented in Table XXVII corresponding to theerror and network size were verified using t-Student statisticaltests; significant differences were found at the confidence levelof 80%. The significance level obtained is not the usual one (aconfidence level of 95% or 99% is usually reported) due to thehigh variance of the results obtained by Prechelt and Grönroos.Nevertheless, differences are significant at the confidence levelof 99% between the G-Prop configurations and in the case ofnetwork size.

Table XXVII shows that the G-Prop method is competitive forthe classification error, with respect to the other methods. Grön-roos [60], with a GA that used Kitano codification to evolveANN, obtained a classification ability of 32 5; Prechelt [59],

using RPROP, obtained 33 5; G-Prop obtains a similar orbetter classification error.

For network size, G-Prop obtain better results (18233)than the methods of other authors (350 for the methods usedby Prechelt [59] and Grönroos [60]).

The G-Prop configuration that uses the parameter values es-tablished in this study obtains better results in classificationability and network size, which demonstrates the validity of theanalysis.

The best classification error was obtained with the smallestsize, with no tradeoff between the two, unlike previous param-eter settings (the best error, 28.3019, does not match the bestsize, 150). Moreover, the results of this parameter setting are ev-ident in the much smaller size of resulting ANN, with the newaverage of 182, compared to the previous best value of 214.

V. APPLICATION OFG-PROP TOREAL-WORLD PROBLEMS

In Section IV, the G-Prop method was statistically analyzedwith regard to its ability to design MLP on a benchmarkproblem.

In this section the method is applied to a real-world patternclassification problem widely used in the bibliography, to verifythat the values obtained in the above analysis are suitable tosolve other problems (even if they are not the optimal set forsuch problems). Another significant consideration is the cost:due to the application cost of ANOVA (computation time andsubsequent analysis), it is better to study a particular problemthan to attempt to apply it to all possible problems. Furthermore,due to the “no free lunch” theorem [68], there is probably nosuch a thing as a single parameter setting that is ideal for allproblems.

The tests used to assess the accuracy (error obtained) of amethod must be chosen carefully, because some (synthetic prob-lems, also called “toy problems”) are not suitable for certain ca-pacities of the BP algorithm, such as generalization [69]. Ouropinion, along with Prechelt [59], is that to test an algorithm, atleast two real-world problems should be used.

In these problems, the division between classes is not as clearas it is in synthetic problems. The dispersion of samples within asingle class is also greater, due to the noise [70]. In any case, theapplication of an algorithm to real-world problems is the way totest its ability and limitations.

For this reason, we used a pattern classification problem thatdeals with the analysis of biomedical images (DNA Helicases),proposed and initially studied by Marabiniet al. [71] and Fer-nándezet al. [72]. This has been used in various studies to testthe classification ability of different methods [71]–[74].

DNA Helicases are enzymes used in the replication, repairand recombination of DNA. All the species in Nature probablycontain a set of helicases, which are employed in most DNAmetabolism processes. In spite of its important role, there is littleknowledge on its structure.

The goal of the classification is to study the structure of a rep-resentative hexametric helicase: the large T antigen of SimianVirus 40 [71], [72]. When observed in the electron microscope,large T antigen preparations mainly show either a character-istic roughly circular view with a stain penetrating region in the


TABLE XIXANOM TABLE (TOP) FOR THEDIFFERENTLEVELS OF THEPOPULATION SIZE, AND ITS EFFECTS ON THEERROR(BOTTOM-LEFT) AND SIZE (BOTTOM-RIGHT)

TABLE XXANOM TABLE (TOP) FOR THEDIFFERENTLEVELS OF THEINITIAL WEIGHT RANGE, AND ITS EFFECTS ON THEERROR(BOTTOM-LEFT) AND SIZE (BOTTOM-RIGHT)

center or top view, or a view with a rectangular shape or sideview (see Fig. 5).

To asses the ability of the method, 1817 examples were ob-tained from microscope images of the T antigen of Simian Virus40. This initial set was classified by hand, obtaining 741 exam-ples of type 0 (side view) and 1076 of type 1 (top view).

Each example is a microscope image with 5050 pixelsand eight levels of gray. Therefore, each particle is representedby a vector of 2500 components, too large for a neural network.Thus, a preprocessing of the data is carried out. The simplestway to do this is by averaging the image by blocks, reducing theresolution to 10 10 pixels. Each example consists of 25 inputs(the 25 blocks of the image to be classified), and the output,where zero is side view and one is top view.

This training set has been used in papers cited at the beginningof this section to classify images of the T antigen of SimianVirus 40. The set consists of 197 examples, divided into thetraining set (78 examples), validating set (60), and testing set(59).

To apply this problem, 50 simulations were carried out, thevalidity of the results presented (average and standard devia-tion) and significant differences between them were verified bymeans of t-Student statistical tests.

Table XXVIII compares the evaluation results of several con-figurations of the proposed method with the DNA Helicasesproblem with the results obtained using QP, those presented byCastilloet al.using simulated annealing [74], and by Mereloetal. using the G-LVQ method [73] (genetic optimization of vec-


TABLE XXIANOM TABLE (TOP) FOR THEDIFFERENTLEVELS OF THESELECTION RATE, AND ITS EFFECTS ON THEERROR(BOTTOM-LEFT) AND SIZE (BOTTOM-RIGHT)

(a)

(b)

Fig. 3. ANOM analysis of effects of the population parameters on (a) the error and (b) size.

torial quantization method, proposed by Kohonen, LVQ [75]).This table shows the average error on the test set, the averagenetwork size measured as number of weights or parameters,

the error and the size corresponding to the individual with thesmaller error over all simulations, and the size of the smallestindividual (and its error).


TABLE XXIIANOVA TABLE FOR THE ERRORWITH THE NUMBER OF GENERATIONS AND NUMBER OF TRAINING EPOCHS ASFACTORS

TABLE XXIIIANOVA TABLE FOR THE SIZE WITH THE NUMBER OF GENERATIONS AND NUMBER OF TRAINING EPOCHS ASFACTORS

TABLE XXIVANOM TABLE (TOP) FOR THE DIFFERENT LEVELS OF THE NUMBER OF GENERATIONS,

AND THEIR EFFECTS ON THEERROR(BOTTOM-LEFT) AND SIZE (BOTTOM-RIGHT)

The results (error and network size) obtained using G-Prop,as well as those presented by Castilloet al. [74] and by Mereloet al. [73], were verified using t-Student statistical tests; dif-ferences between them are significant to the level of 95% and99%, except for the comparison betweenSA-PropandG-Prop ,where differences are significant at the level of 90%.

Table XXVIII shows that G-Prop in its different configu-rations obtains better results (classification error) than othermethods: using G-LVQ [73] the best result is ; SA-Prop[74] obtains a classification error of ; whereas anyconfiguration of G-Prop obtains a smaller error.

As far as the network size is concerned, G-Prop obtainsbigger networks ( parameters) than those obtainedwith G-LVQ (57 parameters). As in the previous test, theaverage size in this case is better than the best size in G-Prop;moreover, error and size are optimized at the same time.

The configuration that uses the parameter value set obtainedin this study obtains better results in classification ability as wellas in network size. This demonstrates the validity of the analysismade. It also shows how a careful analysis of parameter valuescan lead to significant improvements in the quantities optimized(error and network size) and the elimination of the tradeoff be-tween the two.

VI. DISCUSSION ANDCONCLUSION

This paper presents an analysis of the parameters of G-Prop,an algorithm to train MLP based on EA and BP, that automati-cally establishes the values for the parameters of the MLP (ini-tial weights, hidden layer size, and learning rate), thus avoidingthe task of setting these parameters by hand.

This method evolves the learning rate, initial weights andhidden layer size, applying specific genetic operators. Theseoperators modify the MLP by changing their size (number ofhidden layer neurons), certain weight values or the learning rate,combining two networks, or tuning an MLP by means of localsearch (QP).

A rigorous, detailed study of each part of the hybrid methodhas been carried out, establishing the most suitable value foreach parameter, using statistical tools such as ANOVA, ANOM,and t-Student tests. The influence of parameter values on theerror and sizes obtained was determined, and the most suitablevalue for such parameters to achieve good solutions was ob-tained and tested. It was also verified that the genetic operatorsworked as expected.

Due to the high number of parameters involved in the eval-uation, and to avoid the execution of a high number of simula-


TABLE XXVANOM TABLE (TOP) FOR THE DIFFERENT LEVELS OF THENUMBER OF TRAINING EPOCHS, AND THEIR EFFECTS ON THEERROR


(a)

(b)

Fig. 4. ANOM analysis of effects of the number of generations and training epochs on (a) the error and (b) size.

tions (each parameter was evaluated with three or four values),ANOVA was applied in four steps:

• first, to a set of simulations to evaluate the effect of theapplication priorities of the genetic operators;


TABLE XXVIMOST SUITABLE PARAMETER VALUES FOUND

TABLE XXVIIEVALUATION OF THE PROPOSEDMETHOD WITH THE GLASS1A BENCHMARCK

COMPARED WITH RESULTSOBTAINED USING QPAND THOSEPRESENTED BY

PRECHELT USING RPROPAND GRÖNROOSUSING A GA WITH KITANO

CODIFICATION. THE G-PROPCONFIGURATION MARKED WITH CORRESPONDS

TO SIMULATIONS IN WHICH THE DEFAULT PARAMETERS WEREUSED

• second, to a set of simulations to determine the most suit-able values for the mutation operator parameters;

• subsequently, the parameters related to the initial pop-ulation (population size, generation range of the initialweights, and selection rate) were evaluated;

• finally, the number of generations of the EA and thenumber of training epochs for the evaluation of each MLP(when calculating its fitness) were evaluated.

For each set of experiments, ANOVA was applied to deter-mine whether the effect of these parameters on the obtainederror and size was statistically significant, and the ANOM toolwas used to obtain the most suitable value.

Taking into account the set of experiments and the ANOVAstatistical analysis, it was found that those parameters that actdirectly on the MLP (the training operator application priorityand the learning constant mutation range) have greater effectson the error and size obtained. Moreover, two parameters relatedto the initial population, initial weights generation range and se-lection percentage, influence the results obtained. As expected,the selection percentage and the number of training epochs werestatistically significant. This is especially important since theseparameters determine the computation time required.

(a)

(b)

Fig. 5. Two representative images of the training set, corresponding to a (a)top view and (b) a side view.

TABLE XXVIIICOMPARISON OF THE RESULTS OBTAINED BY

EVALUATION OF THE PROPOSEDMETHOD WITH THE DNA HELICASES

BENCHMARCK, THOSE OBTAINED USING QP, THOSE PRESENTED BY

CASTILLO et al.USING SIMULATED ANNEALING, AND BY MERELOet al.USING

THE G-LVQ METHOD. THE G-PROPCONFIGURATION MARKED WITH

CORRESPONDS TOSIMULATIONS IN WHICH THE DEFAULT PARAMETERS

WERE USED

Although these were the parameters that most significantlyaffected the results, after the application of the ANOM tool andthe observation of the main effects plots, it was apparent that ahigher mutation operator application led to better results con-cerning the error as well as the size; the crossover operator, onthe other hand, acts as a macromutation. With respect to theaddition and elimination operators, they acted as expected, in-creasing or reducing the final average size as their applicationrate was increased.


The fact that parameters directly related to the MLP and notto the EA have significant effects indicates that such parametersare even more important in the hybrid method (on the contraryto the claims of some researchers, who use only a GA to trainMLPs). This justifies our aim of optimizing all the parameters ofthe MLP by means of the EA, avoiding the task of setting themby hand. Such parameters are the most important, and smallchanges may produce large variations in the results.

The algorithm is robust: it is not oversensitive to the values ofmost parameters, and works well with the default values. More-over, the values shown in Table XXVI are suitable to solve otherproblems, except those concerning the population size, numberof generations, and the number of training epochs to obtain thefitness. In these cases, a complex problem needs higher values,whereas easy problems can be solved with lower values (re-quiring less computation time).

This analysis could help other authors to evaluate theirmethods (whether hybrid or not) and even to determine themost suitable value for majority of the parameters needed bythese methods. It would be of interest to make a similar analysisof other hybrid algorithms (for example, optimization of RBFby means of EA [76], or hybrid fuzzy-evolutionary algorithms)to determine whether the conclusions obtained here are ofgeneral application. This paper proposes a methodology so thatother authors may rigorously analyze their methods and theresults obtained in their research.

A further question of interest is that, in cases where evalua-tion is expensive and running the algorithms takes a long time,ANOVA analysis might be as effective as, and much faster, thana second level EA for parameter optimization [10].

Finally, although the analysis presented in this paper is basedon establishing the values for a set of parameters (before theexecution of the method), as future work, it would be of interestto control the value of these parameters during execution [17],[77]–[83]. This would avoid the need to carry out a large numberof simulations and a statistical analysis. The use of such controltechniques would produce a suitable solution for execution ofthe EA, while simultaneously obtaining the best set of parametervalues.

ACKNOWLEDGMENT

The authors are grateful to anonymous referees for their con-structive comments and advice about the first revision of thispaper.

REFERENCES

[1] P. A. Castillo, J. Carpio, J. J. Merelo, V. Rivas, G. Romero, and A. Prieto,“Evolving multilayer perceptrons,”Neural Processing Lett., vol. 12, no.2, pp. 115–127, Oct. 2000.

[2] P. A. Castillo, J. J. Merelo, V. Rivas, G. Romero, and A. Prieto,“G-Prop: Global optimization of multilayer perceptrons using GA’s,”Neurocomput., vol. 35/1–4, pp. 149–163, 2000.

[3] P. A. Castillo, M. G. Arenas, J. G. Castellano, M. Cillero, J. J. Merelo, A.Prieto, V. Rivas, and G. Romero, “Function approximation with evolvedmultilayer perceptrons,” inAdvances in Neural Networks and Applica-tions, N. E. Mastorakis, Ed. Singapore: World Scientific, 2001, Artifi-cial Intelligence Series, pp. 195–200.

[4] R. Mead,Thee Design of Experiments. Statistical Principles for Prac-tical Application. Cambridge, U.K.: Cambridge Univ. Press, 1988.

[5] G. Casella and R. L. Berger,Statistical Inference. Belmont, CA:Duxbury, 1990.

[6] D. C. Montgomery,Design and Analysis of Experiments. New York:Wiley, 1984.

[7] L. Davis, Handbook of Genetic Algorithms. New York: Van NostrandReinhold, 1991.

[8] I. Jagielska, C. Matthews, and T. Whitfort, “An investigation into the ap-plication of neural networks, fuzzy logic, genetic algorithms, and roughsets to automated knowledge acquisition problems,”Neurocomput., vol.24, pp. 37–54, 1999.

[9] D. Kim and C. Kim, “Forecasting time series with genetic fuzzy pre-dictor ensemble,”IEEE Trans. Fuzzy Syst., vol. 5, pp. 523–535, Aug.1997.

[10] J. J. Grefenstette, “Optimization of control parameters for genetic algo-rithms,” IEEE Trans. Syst., Man, Cybern., vol. SMC-16, pp. 122–128,Jan. 1986.

[11] R. A. Fisher, “Theory of statistical estimation,”Proc. Cambridge Philo-sophical Soc., vol. 22, pp. 700–725, 1925.

[12] , “The comparison of samples with possibly unequal variances,”Ann. Eugenics, vol. 9, pp. 174–180, 1936.

[13] G. Snedecor and W. Cochran,Statistical Methods, 7th ed, G. W.Snedecor and W. G. Cochran, Eds. Ames, IA: Iowa State Univ. Press,1980.

[14] S. Brandt, Statistical and Computational Methods in Data Anal-ysis. Amsterdam, The Netherlands: North Holland, 1983.

[15] K. A. De Jong, “An Analysis of the Behavior of a Class of Genetic Adap-tive Systems,” Ph.D. dissertation, Univ. Michigan, Ann Arbor, 1975.

[16] R. E. Mercer and J. R. Sampson, “Adaptive search using a reproductivemeta-plan,”Kybernetes, vol. 7, pp. 215–228, 1978.

[17] T. Bäck, D. Fogel, and Z. Michalewicz,Handbook of Evolutionary Com-putation. New York: Inst. Phys., 1997.

[18] R. Hinterding, Z. Michalewicz, and A. E. Eiben, “Adaptation in evolu-tionary computation: A survey,” inProc. 4th IEEE Conf. Evol. Comput.,1997, pp. 65–69.

[19] A. E. Eiben, R. Hinterding, and Z. Michalewicz, “Parameter controlin evolutionary algorithms,”IEEE Trans. Evol. Comput., vol. 3, pp.124–141, Mar. 1999.

[20] G. R. Harik and F. G. Lobo, “A parameter-less genetic algorithm,” inProc. Genetic Evol. Comput. Conf., vol. 1, W. Banzhaf, J. Daida, A. E.Eiben, M. H. Garzon, V. Honavar, M. Jakiela, and R. E. Smith, Eds.,Orlando, FL, 13–17, 1999, pp. 258–265.

[21] T. Bäck, “Optimal mutation rates in genetic search,” inProc. 5th Int.Conf. Genetic Algorithms, S. Forrest, Ed., 1993, pp. 2–8.

[22] D. E. Goldberg,Genetic Algorithms in Search, Optimization, and Ma-chine Learning. Reading, MA: Addison-Wesley, 1989.

[23] D. E. Goldberg, K. Deb, and D. Theirens, “Toward a better under-standing of mixing in genetic algorithms,” inProc. 4th Int. Conf.Genetic Algorithms, R. K. Belew and L. B. Booker, Eds., 1991, pp.190–195.

[24] D. E. Goldberg, K. Deb, and J. H. Clark, “Genetic algorithms, noise andthe sizing of populations,”Complex Syst., vol. 6, pp. 333–362, 1992.

[25] G. Harik, E. Cantú-Paz, D. E. Goldberg, and B. L. Miller, “The gam-bler’s ruin problem, genetic algorithms, and the sizing of populations,”in Proc. 4th IEEE Conf. Evol. Comput., 1997, pp. 7–12.

[26] J. D. Schaffer and A. Morishima, “An adaptive crossover distributionmechanism for genetic algorithms,” inProc. 2nd Int. Conf. Genetic Al-gorithms and Their Applications, J. J. Grefenstette, Ed., 1987, pp. 36–40.

[27] D. Thierens and D. E. Goldberg, “Mixing in genetic algorithms,” inProc. 4th Int. Conf. Genetic Algorithms, R. K. Belew and L. B. Booker,Eds., 1991, pp. 31–37.

[28] D. Thierens, “Dimensional analysis of allele-wise mixing revisited,” inProc. 4th Conf. Parallel Problem Solving From Nature, vol. 1141, Lec-ture Notes in Computer Science, H. M. Voigt, W. Ebeling, I. Rechenberg,and H. P. Schwefel, Eds., 1996, pp. 255–265.

[29] D. Whitley, “The GENITOR algorithm and selection presure: Why rank-based allocation of reproductive trials is best,” inProc. 3rd Int. Conf.Genetic Algorithms, J. D. Schaffer, Ed., 1989, pp. 116–121.

[30] B. A. Julstrom, “Comparing Darwinian, Baldwinian and Lamarckiansearch in a genetic algorithm for the 4-cycle problem,” inProc. Congr.Evol. Comput. Genetic Evolutionary Comput. Conf., Orlando, FL, 1999,pp. 134–138.

[31] I. De Falco, A. Iazzetta, P. Natale, and E. Tarantino, “Evolutionary neuralnetworks for nonlinear dynamics modeling,”Parallel Problem SolvingFrom Nature 98, vol. 1498, pp. 593–602, 1998.

[32] A. Cangelosi, D. Parisi, and S. Nolfi, “Cell division and migration in agenotype for neural networks,”Network: Comput. Neural Syst., vol. 5,pp. 497–515, 1994.

[33] D. Moriarty and R. Miikkulainen, “Hierarchical evolution of neural net-works,” in Proc. 1998 IEEE Conf. Evol. Comput., Anchorage, AK, 1998.


[34] D. Thierens, J. Suykens, J. Vandewalle, and B. De Moor, “Geneticweight optimization of a feedforward neural network controller,”in Proc. Conf. Artificial Neural Nets Genetic Algorithms, 1993, pp.658–663.

[35] J. J. Merelo, M. Patón, A. Canas, A. Prieto, and F. Morán, “Optimizationof a competitive learning neural network by genetic algorithms,” inLec-ture Notes in Computer Science. New York: Springer-Verlag, 1993,vol. 686, pp. 185–192.

[36] , “Genetic optimization of a multilayer neural network for clusterclassification tasks,”Neural Network World, vol. 3, pp. 175–186, 1993.

[37] D. J. Montana and L. Davis, “Training feedforward neural networksusing genetic algorithms,” inProc. 11th Int. Joint Conf. Artificial In-tell., 1989, pp. 762–767.

[38] W. Kinnebrock, “Accelerating the standard backpropagation methodusing a genetic approach,”Neurocomput., vol. 6, pp. 583–588, 1994.

[39] H. De Garis, “Genetic programming: Building artificial neural systemsusing genetically programmed neural network modules,” inProc. 7thInt. Conf. Machine Learning, B. W. Porter and R. J. Mooney, Eds., 1990,pp. 132–139.

[40] R. D. Beer and J. C. Gallagher, “Evolving dynamical neural networks foradaptive behavior,”Adaptive Behavior, vol. 1, no. 1, pp. 91–122, 1992.

[41] X. Yao and Y. Liu, “Toward designing artificial neural networks by evo-lution,” Appl. Math. Comput., vol. 91, no. 1, pp. 83–90, 1998.

[42] M. Riedmiller, “A direct adaptive method for faster backpropagationlearning: The RPROP algorithm,” inIEEE Int. Conf. Neural Networks,vol. 1, 1993, pp. 586–591.

[43] R. A. Jacobs, “Increased rates of convergence through learning rateadaption,”Neural Networks, vol. 1, no. 4, pp. 295–307, 1988.

[44] F. M. Silva and L. B. Almeida, “Acceleration techniques for the back-propagation algorithm,” inProc. EURASIP Workshop, vol. 412, Lec-tures Notes in Computer Science, L. B. Almeida and C. J. Wellekens,Eds., New York, 1990, pp. 110–119.

[45] J. B. Lamarck,Philos. Zool., 1809.[46] I. Bellido and G. Fernandez, “Backpropagation growing networks: To-

ward local minima elimination,” inProc. 1st Inst. Conf. Artificial NeuralNetworks. New York: Springer-Verlag, 1991, vol. 540, Lecture Notesin Computer Science, pp. 130–135.

[47] G. Bebis, M. Georgiopoulos, and T. Kasparis, “Coupling weight elimi-nation with genetic algorithms to reduce network size and preserve gen-eralization,”Neurocomput., vol. 17, pp. 167–194, 1997.

[48] R. D. Reed, “Pruning algorithms—A survey,”IEEE Trans. Neural Net-works, vol. 4, pp. 740–744, Sept. 1993.

[49] R. D. Reed and R. J. Marks,Neural Smithing. Cambridge, MA: MITPress/Bradford, 1999.

[50] X. Yao, “Evolving artificial neural networks,”Proc. IEEE, vol. 87, pp.1423–1447, Sept. 1999.

[51] T. Jasic and H. Poh, “Analysis of pruning in backpropagation networksfor artificial and real world mapping problems,” inProc. 3rd Int. COnf.Articial Neural Networks. New York: Springer-Verlag, 1995, vol. 930,Lecture Notes in Computer Science, pp. 239–245.

[52] M. Pelillo and A. Fanelli, “A method of pruning layered feed-forwardneural networks,” inLecture Notes in Computer Science. New York:Springer-Verlag, 1993, vol. 686, pp. 278–283.

[53] G. Thimm and E. Fiesler, “Neural network initialization,” inLectureNotes in Computer Science. New York: Springer-Verlag, 1995, vol.930, pp. 535–542.

[54] J. F. Kolen and J. B. Pollack, “Back Propagation is Sensitive to InitialConditions,” Lab. Artificial Intell. Res., Comput. Inform. Sci. Dept., TR90-JK-BPSIC, 1990.

[55] Y. Lee, S. H. Oh, and M. W. Kim, “An analysis of premature saturationin back-propagation learning,”Neural Networks, vol. 6, pp. 719–728,1993.

[56] S. E. Fahlman, “Faster-learning variations of back-propagation: An em-pirical study,” inProc. 1988 Connectionist Models Summer School, D.S. Touretzky, G. Hinton, and T. Sejnowski, Eds., San Mateo, CA, 1988,pp. 38–51.

[57] M. Fernández and C. Hernández, “A comparison among weight inici-atization methods for multilayer feedforward networks,” inProc. 2000IEEE Int. Joint Conf. Neural Networks, vol. 4, 2000, pp. 543–548.

[58] , “Weight initialization methods for multilayer feedforward,”presented at the European Symp. Artificial Neural Networks(ESANN’2001), Brugues, Belgium, 2001.

[59] L. Prechelt, “PROBEN1—A Set of Benchmarks and BenchmarkingRules for Neural Network Training Algorithms,” Fakultät für Infor-matik, Universität Karlsruhe, Karlsruhe, Germany, 21/94, 1994.

[60] M. A. Grönroos, “Evolutionary Design of Neural Networks,” Master ofScience, Dept. Math. Sci., Univ. Turku, Turku, Finland, 1998.

[61] (2000) MINITAB Statistical Software, Release 13. Minitab, Inc.. [On-line]. Available: http://www.minitab.com

[62] D. E. Rumelhart and J. L. McClelland,Parallel Distributed Processing:Explorations in the Microstructures of Cognition. Cambridge, MA:MIT Press, 1986.

[63] M. Land, “Evolutionary Algorithms With Local Search for Combina-torial Optimization,” Ph.D. dissertation, Comput. Sci. Eng. Dept. Univ.California-San Diego, La Jolla, 1998.

[64] V. W. Porto, D. B. Fogel, and L. J. Fogel, “Alternative neural networktraining methods,”IEEE Expert, vol. 10, pp. 16–22, 1995.

[65] A. P. Topchy and O. A. Lebedko, “Neural network training by means ofcooperative evolutionary search,”Nuclear Instrum. Methods Phys. Res.,Section A: Accelerators, Spectrometers, Detectors, Assoc. Equipment,vol. 389, no. 1–2, pp. 240–241, 1997.

[66] P. Osmera, “Optimization of neural networks by genetic algorithms,”Neural Network World, vol. 5, no. 6, pp. 965–976, 1995.

[67] B. Yoon, D. H. Holmes, and G. Langholz, “Efficient genetic algorithmsfor training layered feedforward neural networks,”Inform. Sci., vol. 76,no. 1–2, pp. 67–85, 1994.

[68] D. H. Wolpert and W. G. Macready, “No free lunch theorems for opti-mization,” IEEE Trans. Evol. Comput., vol. 1, pp. 67–82, Jan. 1997.

[69] S. E. Fahlman, “An Empirical Study of Learning Speed in Back-Prop-agation Networks,” Carnegie Mellon Univ., Pittsburgh, PA, Tech. Rep,1988.

[70] J. J. Merelo, “Estudio, Aplicaciones y Optimización Mediante Algo-ritmos Genéticos de Algoritmos Neuronales de Cuantización Vectorial,”Doctoral dissertation, Departamento de Electrónica y Tecnología deComputadores Universidad de Granada, Granada, Spain, 1994.

[71] R. Marabini and J. M. Carazo, “Pattern recognition and classification ofimages of biological macromolecules using artificial neural networks,”Biophys. J., vol. 66, pp. 1904–1814, 1994.

[72] J. J. Fernández and J. M. Carazo, “Analysis of structural variabilitywithin two-dimensional biological crystals by a combination of patchaveraging techniques and self-organizing maps,”Ultramicroscopy, vol.65, pp. 81–93, 1996.

[73] J. J. Merelo, A. Prieto, F. Morán, R. Marabini, and J. M. Carazo, “Au-tomatic classification of biological particles from electron-microscopyimages using conventional and genetic-algorithm optimized learningvector quantization,”Neural Processing Lett., vol. 8, pp. 55–65, 1998.

[74] P. A. Castillo, J. González, J. J. Merelo, V. Rivas, G. Romero, andA. Prieto, “SA-Prop: Optimization of multilayer perceptron param-eters using simulated annealing,” inLecture Notes in ComputerScience. New York: Springer-Verlag, 1999, vol. 1606, pp. 661–670.

[75] T. Kohonen, “The self-organizing map,”Proc. IEEE, vol. 78, pp.1464–1480, Sept. 1990.

[76] V. M. Rivas, P. A. Castillo, and J. J. Merelo, “Evolving RBF neuralnetworks,” in Lecture Notes in Computer Science, J. Mira and A.Prieto, Eds. New York: Springer-Verlag, 2001, vol. I, LNCS 2084,pp. 506–513.

[77] T. Bäck, “The interaction of mutation rate, selection, and self-adaptionwithin a genetic algorithm,” inProc. 2th Int. Conf. Parallel ProblemSolving From Nature, R. Männer and B. Manderick, Eds., 1992, pp.85–94.

[78] , “Self-adaption in genetic algorithms,” inProc. 1st European Conf.Artificial Life: Toward a Practice of Autonomous Systems, F. J. Varelaand P. Bourgine, Eds., 1992, pp. 263–271.

[79] L. Davis, “Adapting operator probabilities in genetic algorithms,” inProc. 1st Int. Conf. Genetic Algorithms Their Applicat., J. J. Grefen-stette, Ed., 1989, pp. 61–69.

[80] J. Hesser and R. Männer, “Toward and optimal mutation probability forgenetic algorithms,” inProc. 1st Conf. Parallel Problem Solving FromNature, vol. 496, Lecture Notes in Computer Science, H. P. Schwefeland R. Männer, Eds. New York, 1991, pp. 23–32.

[81] I. Rechenberg,Evolutionsstrategie: Optimierung Technischer SystemeNach Prinzipien der Biologischen Evolution. Stuttgart, Germany:Frommann-Hozboog, 1973.

[82] T. Soule and J. A. Fosteret al., “Code size and depth flows in genetic pro-gramming,” inProc. 2nd Annu. Conf. Genetic Programming, K. Kozaet al., Eds., 1997, pp. 313–320.

[83] G. Syswerda, “Schedule optimization using genetic algorithms,” inHandbook of Genetic Algorithms, L. Davis, Ed. New York: VanNostrand Reinhold, 1991, pp. 332–349.


Pedro A. Castillo-Valdiviesoreceived the B.Sc. andPh.D. degrees, both in computer science from theUniversity of Granada, Granada, Spain, in 1997.

He was a Visiting Researcher at the Napier Uni-versity, Edinburg, U.K., in July 1998 and at the SantaFe Institute, Santa Fe, NM, in September 2000. Hismain interests include the optimization of artificialneural networks using evolutionary algorithms. Hewas a Teaching Assistant at Computer Science De-partment of the University of Jaén, Jaén, Spain. He iscurrently a Teaching Assistant at Department of Ar-

chitecture and Computer Technology of the University of Granada.

Juan J. Merelo received the B.Sc. degree in theo-retical physics and the Ph.D. degree, both from theUniversity of Granada, Granada, Spain, in 1988 and1994.

He has been Visiting Researcher at Santa FeInstitute, Santa Fe, NM, Politecnico Torino, Turin,Italy, RISC-Linz, Linz, Austria, the University ofSouthern California, Los Angeles, and UniversitéParis-V, Paris, France. He is currently AssociateProfessor at the Computer Architecture and Tech-nology Department, University of Granada. His

main interests include neural networks, genetic algorithms, and artificial life.

Alberto Prieto (M’74) received the B.Sc. degreein electronic physics in 1968 from the ComplutenseUniversity, Madrid, Spain, and the Ph.D. degreefrom the University of Granada, Granada, Spain, in1976, obtaining the Award of Ph.D. Dissertationsand the Citema Foundation National Award.

From 1969 to 1970, he was at the Centro de Investi-gaciones Técnicas de Guipuzcoa and at the E.T.S.I In-dustriales, San Sebastián, Spain. From 1971 to 1984,he was Director of the Computer Centre and from1985 to 1990, he was Dean of the Computer Science

and Technology studies, the University of Granada. He is currently Full Pro-fessor and Director of the Department of Computer Architecture and Tech-nology at the Univ. of Granada. He was a Visiting Researcher in the University ofRennes in 1975, the Open University, U.K., in 1987, the Institute National Poly-thechnique of Grenoble, Grenoble, France, in 1991, and the University Collegeof London, Lonson, U.K., from 1991 to 1992. His research interests include in-telligent systems.

Dr. Prieto was the Organization Chairman of the International Workshopon Artificial Neural Networks (IWANN’91) and of the 7th InternationalConference on Microelectronics for Neural, Fuzzy and Bio-inspired Systems(MicroNeuro’99). He was nominated a Member of the IFIP WG 10.6 (NeuralComputer Systems) and Chairman of the Spanish RIG of the IEEE NeuralNetworks Council..

Ignacio Rojas received the M.S. degree in physicsand electronics from the University of Granada,Granada, Spain, in 1992 and the Ph.D. degree inneuro-fuzzy networks in 1996.

During 1998, he was a Visiting Professor ofthe BISC at the University of California-Berkeley,Berkeley. He is a Teaching Assistant with the De-partment of Architecture and Computer Technologyat the University of Granada. His main researchinterests are in the fields of hybrid system andcombination of fuzzy logic, genetic algorithms and

neural networks, financial forecasting, and financial analysis.

Gustavo Romero received the B.Sc. degree incomputer science from the University of Granada,Granada, Spain, in 1997. He is currently pursuingthe Ph.D. degree at the University of Granada, with athesis is on evolutionary computation visualization.

He has been working as Programmer for CICYTproyect BIO96-0895. He was a Visiting Researcherat the Napier University in January 2001. He iscurrently working under a fellowship from theregional government at the Computer Architectureand Technology Deptartment of the University of

Granada. His main interests in research are in neural networks and evolutionarycomputation.

Statistical analysis of the parameters of a neuro-genetic ... · INTEREST in hybrid methods that...

Documents

Transcript of Statistical analysis of the parameters of a neuro-genetic ... · INTEREST in hybrid methods that...