A Parallel Evolutionary Algorithm for Flexible Neural Tree

download A Parallel Evolutionary Algorithm for Flexible Neural Tree

of 14

Transcript of A Parallel Evolutionary Algorithm for Flexible Neural Tree

Parallel Computing 37 (2011) 653666

Contents lists available at SciVerse ScienceDirect

Parallel Computingjournal homepage: www.elsevier.com/locate/parco

A parallel evolving algorithm for exible neural treeLizhi Peng, Bo Yang, Lei Zhang, Yuehui Chen Shandong Provincial Key Laboratory of Network Based Intelligent Computing, University of Jinan, Jinan 250022, PR China

a r t i c l e

i n f o

a b s t r a c tIn the past few decades, much success has been achieved in the use of articial neural networks for classication, recognition, approximation and control. Flexible neural tree (FNT) is a special kind of articial neural network with exible tree structures. The most distinctive feature of FNT is its exible tree structures. This makes it possible for FNT to obtain near-optimal network structures using tree structure optimization algorithms. But the modeling efciency of FNT is always a problem due to its two-stage optimization. This paper designed a parallel evolving algorithm for FNT (PE-FNT). This algorithm uses PIPE algorithm to optimize tree structures and PSO algorithm to optimize parameters. The evaluation processes of tree structure populations and parameter populations were both parallelized. As an implementation of PE-FNT algorithm, two parallel programs were developed using MPI. A small data set, two medium data sets and three large data sets were applied for the performance evaluations of these programs. Experimental results show that PE-FNT algorithm is an effective parallel FNT algorithm especially for large data sets. 2011 Elsevier B.V. All rights reserved.

Article history: Received 16 February 2009 Received in revised form 11 June 2011 Accepted 19 June 2011 Available online 29 June 2011 Keywords: Flexible neural tree Parallel algorithm Performance evaluation Articial neural network

1. Introduction Articial neural network (ANN) is an important articial intelligence technique. In the past few decades, much success has been achieved in the use of articial neural networks for classication [15,32], recognition [20,12], and control. Generalization ability is an important factor of articial neural networks because their actual application abilities rest with their generalization abilities. The generalization ability of an ANN is highly dependent on its structure. In traditional articial neural network techniques, network structures are designed in static and manual ways. So the ultimate performance of a neural network is dependent some degree on the experiences of the designer. Many research works have attempted to design constructive algorithms which can provide an automatic way of nding optimal network structures [17,11,30]. Most of these attempts use dynamic designing processes or integrated methods while training a neural network. Flexible neural tree (FNT) [5,6,8] is a special kind of articial neural network proposed in recent years. The most distinctive feature of FNT is that FNT is designed with exible tree structures. This makes it possible for FNT to auto design system structures by using algorithms like IP [24], PIPE [28], GP [1] and so on. So FNT model can obtain high generalization abilities in many applying problems [7,26,27,33]. Based on a pre-dened instruction/operator set, a FNT model can be created and evolved. FNT allows input variables selection, over-layer connections and different activation functions for different nodes. In our previous work, the tree structure was evolved using genetic programming (GP) [7], probabilistic incremental program evolution algorithm (PIPE) [6], extended compact genetic programming (ECGP), immune programming (IP) [8], with specic instructions. The ne tuning of the parameters encoded in the structure could be accomplished using simulated annealing (SA) [6], particle swarm optimization (PSO) [9], memetic algorithm (MA) [7]. Corresponding author.E-mail addresses: [email protected] (L. Peng), [email protected] (B. Yang), [email protected] (L. Zhang), [email protected] (Y. Chen). 0167-8191/$ - see front matter 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.parco.2011.06.001

654

L. Peng et al. / Parallel Computing 37 (2011) 653666

There are two stages in the optimization of FNT model. The tree structure is rstly optimized using a certain algorithm described above. At the very beginning of the optimization process, a tree structure population is generated. Then each structure in the population is evaluated and the best one is selected. Once the best structure has been selected, the next stage is to optimize the parameters of this structure. Then a new two-stage optimization is executed. This iterative process is called the evolving process of FNT model. Each time a new structure has been found, the parameters of this new structure has to be optimized. It means we have to optimize both tree structures and parameters. The computational costs of this two-stage optimization is obviously higher than that of a simple parameter optimization of a traditional ANN with a xed network structure, especially for large data sets. So the modeling efciency of FNT is always a problem. Parallel algorithms is an effective way to resolve large scale computing problems. Among these problems, designing parallel evolutionary algorithms is important research area [16,22,10]. In recent years, the study of parallel algorithms for problems with hierarchical structures is also a hot topic [13,14]. Also many efforts had been made to parallelize traditional neural networks. [23] implemented a parallel feed-forward neural network by dividing up training sets among the processors; Phua and Ming [25] used self-scaling quasi-Newton (SSQN) methods for parallel training of neural network; Also some studies focused on the parallel implementations of neural networks on certain hardware platforms [4,29]. This article designed a parallel evolving algorithm for FNT (PE-FNT). This algorithm uses PIPE algorithm to optimize tree structures and PSO algorithm to optimize parameters. In this algorithm, the evaluation processes of tree structure populations and parameter populations both have been parallelized. As an implementation of this algorithm, two parallel programs were developed using MPI. One was designed using the phase parallel model, and the other was designed using the working pool parallel model. A small data set, two medium data sets and three large data sets were applied for the performances evaluation of these parallel programs of PE-FNT. The rest of the paper is organized as follows: The exible neural tree model and its design method are given in Section 2. Section 3 gives the details of the proposed parallel algorithm. As the standard performance evaluation method is not suitable for our research, a modied performance evaluation method is also given in Section 3. Section 4 presents some performance evaluation results using six well-known data sets. At last, some concluding remarks are presented in Section 5. 2. Basic FNT model As introduced in Section 1, FNT is a special kind of articial neural network with exible tree structures. It is relatively easy for a FNT model to obtain near-optimal structure using tree structure optimization algorithms. The leaf nodes of FNT are input nodes and the non-leaf nodes are neurons. The output of root node is also the output of the whole system. FNT model use two types of instruction sets, the function set F and terminal instruction set T for constructing nodes in tree structures. F is used to construct non-leaf nodes and T is used to construct leaf nodes. They are described as S S S = F T = {+2, +3, . . . , +m} {x1, x2, . . . , xn}, where +i (i = 2, 3, . . . , m) denote non-leaf nodes instructions and taking i arguments. x1, x2, . . . , xn are leaf nodes instructions and taking no other arguments. The output of a non-leaf node is calculated as a exible neuron model (see Fig. 1). From this point of view, the instruction +i is also called a exible neuron operator with i inputs. In the creation process of neural tree, if a nonterminal instruction, i.e., +i (i = 2, 3, . . . , m) is selected, i real values xij (j = 1, 2, . . . , i) are randomly generated and used for representing the connection strength between the node +i and its children. In addition, two adjustable parameters ai and bi are randomly created as exible activation function parameters. For developing the FNT, the following exible activation function is used,

f ai ; bi ; x e

xai bi

2 : 1

Fig. 1. A exible neuron operator (left), and a typical representation of the FNT with function instruction set F = {+2, +3, +4, +5, +6}, and terminal instruction set T = {x1, x2, x3}(right).

L. Peng et al. / Parallel Computing 37 (2011) 653666

655

The total excitation function of +i is

neti

i X j1

wj xj ; 2

2

where xj (j = 1, 2, . . . , i) are the inputs to node +i. The output of the node +i is then calculated by,

out i f ai ; bi ; net i e

net i ai bi

:

3

The overall output of exible neural tree can be computed from left to right by depth-rst method, recursively. 2.1. Tree structure optimization using PIPE PIPE [28] combines probability vector coding of program instructions, population-based incremental learning [3], and tree-coded programs. PIPE iteratively generates successive populations of functional programs according to an adaptive probability distribution, represented as a probabilistic prototype tree (PPT), over all possible programs. Each iteration uses the best program to rene the distribution. Thus, the structures of promising individuals are learned and encoded in the PPT. PPT is very important for PIPE algorithm. Each program in a population is generated according to the PPT. That means the PPT is a control factor in population generating process. And the PPT is also adjusted during the iterative process of the population. The PPT stores the knowledge gained from experiences with programs (trees) and guides the evolutionary search. It holds the probability distribution over all possible programs that can be constructed from a predened instruction set. The PPT is generally a complete n-ary tree with innitely many nodes, where n is the maximal number of function arguments. ! ! Each node Nj in PPT, with j P 0 contains a variable probability vector Pj . Each P j has n + m 1 components, where n is the number of instructions in set T, m 1 is the number of instructions in set F (since the rst instruction in F is +2), and ! n + m 1 is the number of instructions in instruction set S. Each component Pj(I) of Pj denotes the probability of choosing ! instruction I 2 S at node Nj. Each vector P j is initialized as follows:

Pj I

PT 8I : I 2 T; n 1 PT Pj I 8I : I 2 F; m1

4 5

where PT is the total probability to select terminal instructions. PIPE combines two forms of learning: generation-based learning (GBL) and elitist learning (EL). GBL is a learning strategy according to the best program of current population. While EL is a learning strategy according to the global best program. The global best program is also called the elitist program. GBL is PIPEs main learning algorithm. The purpose of EL is to use the best program found so far as an attractor. The whole PIPE learning frame is as follows: 1: repeat 2: with probability Pel do EL 3: otherwise do GBL 4: until termination criterion is reached Here Pel is a user-dened constant in [0, 1]. 2.1.1. Generation-based learning The main steps of the generation-based learning can be described as follows: (1) Creation of program population. A population of programs PROGj (0 < j 6 PS;PS is population size) is generated using the prototype tree PPT. (2) Population evaluation. Each program P ROGj of the current population is evaluated on the given task and assigned a tness value FITPROGj according to the tness function. The best program of the current population (the one with the smallest tness value) is denoted P ROGb . The best program found so far (elitist) is preserved in Pel . ROG (3) Learning from population. Prototype tree probabilities are modied such that the probability PP ROGb of creating PROGb increases. This procedure is called adapting PPT towards P ROGb . This is implemented as follows. First PP ROGb is computed by looking at all PPT nodes Nj used to generate PROGb :

PPROGb

Y

Pj Ij P ROGb ;b

6

j:Nj used to generate PROG

where Ij PROGb denotes the instruction of program P ROGb at node position j. Then a target probability PTARGET for P ROGb is calculated:

PTARGET PPROGb 1 PPROGb lr

e FITPel ROG : e FITPROGb

7

656

L. Peng et al. / Parallel Computing 37 (2011) 653666

Here lr is a constant learning rate and e a positive user-dened constant. Given PTARGET, all single node probabilities Pj Ij PROGb are increased iteratively:do

Pj Ij PROGb Pj Ij PROGb c lr 1 P j Ij PROGb

8

until P j Ij P ROGb P PTARGET where c is a constant inuencing the number of iterations. The smaller c the higher the approximation precision of PTARGET and the number of required iterations. Setting c = 0.1 turned out to be a good compromise between precision ! and speed. And then all adapted vectors Pj are renormalized. (4) Mutation of prototype tree. All probabilities Pj(I) stored in nodes Nj that were accessed to generate program P ROGb are mutated with probability P Mp :

Pj I Pj I mr 1 P j I;!

9

where mr is the mutation rate, another user-dened parameter. Also all mutated vectors P j are renormalized. (5) Prototype tree pruning. At the end of each generation the prototype tree is pruned. PPT subtrees attached to nodes that contain at least one probability vector component above a threshold TP can be pruned. (6) Termination criteria. Repeat above procedure until a xed number of program evaluations is reached or a satisfactory solution is found. In our study, we use two rules to terminate the iteration: one is the maximum iteration number, the other is the critical tness value. Either the iteration number has reached the maximum value or the tness of the global best program has achieved the critical value (it means a satisfactory solution is found), the iterative process will be terminated. Here the critical value is an empirical value which was set with 0.001 in this paper. 2.1.2. Elitist learning Elitist learning focuses search on previously discovered promising parts of the search space. The basic ow of EL and GBL is the same. But the PPT is adapted towards the best program of current population in GBL, while in EL, the PPT is adapted towards the elitist program. So we also use Eqs. (6) and (7) to adapt the PPT, but PROGb is replaced with Pel in these equaROG tions. EL is particularly useful with small population sizes and works efciently in the case of noise-free problems. In order to learn the structure and parameters of a FNT simultaneously there is a tradeoff between the structure optimization and parameter learning. In fact, if the structure of the evolved model is not appropriate, it is not useful to pay much attention to the parameter optimization. On the contrary, if the best structure has been found, further structure optimization may destroy the best structure. In this paper, a technique for balancing the structure optimization and parameter learning is proposed. If a better structure is found then do local search (simulated annealing) for a number of steps (maximum allowed steps) or stop in case no better parameter vector is found for a signicantly long time (say 1002000 in our experiments). The criterion of better structure is distinguished as follows: if the tness value of the best program is smaller than the tness value of the elitist program, or the tness values of two programs are equal but the nodes of the former is lower than the later, then we say that the better structure is found. 2.2. Parameters optimization using PSO Particle swarm optimization (PSO) [18,31] conducts searches using a population of particles that correspond to individuals in an evolutionary algorithm. A population of particles is randomly generated initially. Each particle represents a poten tial solution and has a position represented by a position vector xi . A swarm of particles moves through the problem space, with the moving velocity of each particle represented by a velocity vector v i . At each time step, a function fi representing a quality measure is calculated by using xi as input. Each particle keeps track of its own best position, which is associated with the best tness it has achieved so far in a vector pi . Furthermore, the best position among all the particles obtained so far in the population is kept track of as pg . The best position means the global best solution obtained so far. In addition to this global version, another version of PSO keeps track of the best position among all the topological neighbors of a particle (For more information on neighborhood topology we refer to [19,21]). At each time step t, by using the individual best position, pi t, and the global best position, pg t, a new velocity for particle i is updated by

vi t 1 v i t c1 /1 pi t xi t c2 /2 pg t xi t;

10

where c1 and c2 are positive constant and /1 and /2 are uniformly distributed random numbers in [0,1]. The term v i is limited to the range of vmax. If the velocity violates this limit, it is set to its proper limit. Changing velocity this way enables the particle i to search around its individual best position, pi , and global best position, pg . Based on the updated velocities, each particle changes its position according to the following equation:

xi t 1 xi t v i t 1:

11

Based on Eqs. (10) and (11), the population of particles tend to cluster together with each particle moving in a random direction. This may result in premature convergence in many problems. An effective method to avoid premature convergence is to update the velocity by the following formula [19]:

L. Peng et al. / Parallel Computing 37 (2011) 653666

657

vi t 1 vxv i t c1 /1 pi t xi t c2 /2 pg t xi t;

12

where two new parameters, v and x, are also real numbers. The parameter v controls the magnitude of v , whereas the iner tia weight x weights the magnitude of the old velocity v i t in the calculation of the new velocity v i t 1. 3. Parallel evolving algorithm for FNT There are two stages in the basic evolving process of FNT. Both processes of these two stage can be parallelized. The rst one is the evaluation process of tree structure population. Each time the tree structure optimization algorithm generate a new population, tree structures in the population should be evaluated one by one. This task can be parallelized. In this paper, the evaluation of a single tree structure is dened as a basic computation unit. The second one is the evaluation of parameter population. When a tree structure has been selected, the parameters of this structure should be optimized using PSO algorithm. And each time a parameter population is generated, individuals of parameter population also should be evaluated one by one. This task can also be parallelized. The evaluation of a single parameter individual is also dened as a basic computation unit. For a large training data set, most computational time are spent on these two evaluation processes. To improve the efciency of FNT model, this paper designed a parallel evolving algorithm for FNT (PE-FNT). 3.1. Algorithm designing 3.1.1. Basic applied parallel model and main algorithm In FNT model, it is not necessary for most PIPE processes such as learning and mutation to be parallelized. Because computational costs of these processes are relatively low. If all of these processes were parallelized, the parallel efciency shouldMaster process Begin Slave process Begin

Generate new tree structure population Task message Dispatch tree structure evaluation tasks Receive computational task

Select the best tree structure Tree structure learning and mutation

Result Task message Tree structure evaluation

Generate new parameter population

Result

Parameter evaluation Y

Dispatch parameter evaluation tasks PSO learning and updating N Achieved desired fitness Y Obtained desired tree structure N

Is tree

Y N

Is parameter

N Y End End

Fig. 2. The basic parallel model and the main owchart of PE-FNT algorithm.

658

L. Peng et al. / Parallel Computing 37 (2011) 653666

be dramatically decreased owing to the increasing of communication spendings. So these tasks are suitable for concentrative handling. Similarly, the learning and particle updating processes of PSO algorithm are also not suitable for parallel computing. So the MasterSlave model is the basic model applied for our study. The master process executes the concentrative tasks, and slave processes deal the computational tasks of tree structure population evaluation and parameter population evaluation. The main algorithm of the master process is designed as follows: Main algorithm of master process 1: do 2: Generate a new tree structure population 3: Generate parallel tree structure evaluation tasks and dispatch these tasks to slave processes 4: Receive tree structure evaluation results from slave processes 5: Select the best tree structure in the population 6: do 7: Generate a new parameter population 8: Generate parallel parameter evaluation tasks and dispatch these tasks to slave processes 9: Receive parameter evaluation results from slave processes 10: Parameter particles learning and updating according to PSO 11: until achieve the desired tness 12: Tree structures learning and mutation according to PIPE 13: until obtain the desired tree structure

In this algorithm, the desired tness of line 11 is a predened value. If the tness of any parameter individual achieved this value, then the parameter optimization process should be terminated. The desired tree structure of line 13 also means a structure whose tness satises a predened value. For slave processes, they only receive computational tasks and send back the results as soon as they accomplished their tasks. So the main algorithm is as follows: Main algorithm of slave process 1: do 2: Receive a message from the master process 3: if the message is a tree structure evaluation task 4: Evaluate the tree structure 5: Send back the evaluation result 6: else if the message is a parameter evaluation task 7: Evaluate the parameter 8: Send back the evaluation result 9: else 10: break 11: end if 12: until receive a end message Fig. 2 illustrates the main algorithm and the basic MasterSlave model of PE-FNT algorithm. 3.1.2. Parallelization of tree structure/parameter population evaluation In PE-FNT algorithm, each tree structure population is generated by the master process according to PIPE algorithm. When a new tree structure population has been generated, it should be evaluated. Similarly, a parameter population is also generated in the master process according to PSO algorithm, and each individual in the population should be evaluated. If there are s parallel slave processes and the scale of a tree structure or parameter population P is m, where s 6 m. Then some other parallel models can be employed to parallelize the evaluation process: (1) Phase Parallel Model (PPM). The population P is split into s sub-population {P1, P2, . . . , Ps}, where Pi(i = 1, 2, . . . , s) includes mi(0 6 mi 6 m) tree structure/parameter individuals, ands X i1

mi m:

13

The evaluations of these s sub-population are s parallel tasks {Task1, Task2, . . . , Tasks}, and they are assigned to the s parallel processes. When any parallel process has nished its computing, it send the evaluation result to the master process. The master process waits for results until all tasks have been nished.

L. Peng et al. / Parallel Computing 37 (2011) 653666

659

The evaluation algorithm of master process for phase parallel model 1: for i = 1 to s 2: Send task i to the slave process i 3: end for 4: for nish_count = 1 to s 5: Wait to receive a result from any slave process 6: Update the result 7: end for The evaluation algorithm of slave process for phase parallel model 1: 2: 3: wait to receive a evaluation task message from the master process. Evaluate the tree structure/parameter. Send back the evaluation result the master process.

Suppose the serial execution time to complete Taski is Ti, where i = 1, 2, . . . , s. Here the serial execution time is the time used by a serial program to complete the task. Then the execution time for the parallel program to complete the whole tree structure/parameter population evaluation is

T MAXfT 1 ; T 2 ; . . . ; T s g:(2) Working Pool Parallel Model (WPPM)

14

In this model, the evaluation of each tree structure/parameter individual is a single task. The evaluations of all individuals of the population make up a working pool with a size of m. A slave process sends a request to the master process, then the master process get a single task from the working pool and dispatch this task to the slave process. The slave process send back the result as soon as it nish the task, and then requests for a new task until the working pool is empty. The evaluation algorithm of master process for working pool parallel model 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: nish_count = 0 for i = 1 to m Wait to receive a request or result message from any slave process if the message is a request for a task Get the next task from the working pool Send the task to the slave process which sent the request else Update the result + + nish_count end if end for while nish_count < m Wait to receive a result from any slave process Update the result end while for i = 1 to s Send a no-task message to slave process i end for

The evaluation algorithm of slave process for working pool parallel model 1: 2: 3: 4: 5: 6: 7: 8: 9: do Send a request message to the master process Wait to receive a message from the master process if the message is a evaluation task Evaluate the tree structure/parameter Send back the evaluation result to the master process else if the message is no-task message break else (continued on next page)

660

L. Peng et al. / Parallel Computing 37 (2011) 653666

10: continue 11: end if 13:until receive a no-task message In working pool parallel model, the whole evaluation task is also logically split into s parallel tasks {Task01 , Task20 , . . . , Task0s }. In phase parallel model, parallel tasks are generated according to the number of evaluation tasks. The master process dispatchs evaluation tasks to parallel tasks uniformly in number. The execution time of different evaluation tasks are very different, so the time parallel processes nish their tasks may be very different. For example, the time to evaluate a tree with 1000 nodes is far more than the time to evaluate a tree with only 10 nodes. If a parallel task contains 10 trees with 1000 nodes and another parallel task contains 10 trees with 10 nodes, then when the second task has been nished, the corresponding process has to waste much time for waiting. While in working pool parallel model, evaluation tasks are dispatched by actual requests. This can effectively reduce the waiting time of parallel processes. So the partition of the task 0 has been optimized. Suppose the execution time to complete Taski is T 0i , where i = 1, 2, . . . , s, the mean value of T1, T2, . . . , Ts 0 0 0 0 is Tavg, and T av g for T 1 ; T 2 ; . . . ; T s . Then their variance are

s Ps i1 T i T av g ; D s

v uPs 0 0 u t i1 T i T av g 0 : D s

15

The greater the execution time variance of a task partition solution is, the more time is spent for waiting. According to the analysis above, we have the inequality

D P D0 :3.2. Performance evaluation of the parallel program

16

TS As we know, speedup ratio SP T P and efciency EP SPP are two basic methods which are used to evaluate the performance of a parallel program. Where P is the number of parallel processes (P = s + 1 in this work, 1 master process and s slave processes), TS is the execution time of a serial program and TP is the wall time of the parallelized version of this program on P homogeneous CPUs. TS Numerical efciency NEP C T and parallel efciency PEP C TC T T are two other important methods for the performance D evaluation of a parallel program. Where CT is the total CPU time of the parallel program and DT is the total idle time. In this paper, SP, EP, NEP and PEP are used together to evaluate the performances of our parallel programs. In FNT model, for each time we execute the program, the execution time may not be the same because tree structure population are generated according to a random strategy. And the computational costs of different tree structures are not the same. So it is not correct to use the simple TS and TP to calculate speedup ratio, efciency and numerical efciency when the performance of a PE-FNT program is evaluated. For a tree structure population which includes m individuals Tree1, Tree2, . . . , Treem, suppose Treei has n1i non-leaf nodes and n2i leaf nodes. According to basic FNT model, the activation function which was dened in Eq. (1) should be executed n1i times for a training sample during the tree structure evaluation process. While the excitation function which was dened in Eq. (2) should be executed n1i + n2i times. So for the whole structure population, the execution times of the activation funcP P tion and the excitation function are m n1i and m n1i n2i . If the number of iteration of the tree structure optimization is i1 i1 l, then the execution times of the activation function and the excitation function are

N1tree and

l m XX j1 i1

n1ji

17

N2tree

l m XX n1ji n2ji ; j1 i1

18

where n1ji is the number of non-leaf nodes in the ith tree structure of the jth iteration, and n2ji is the number of leaf nodes, N1tree is the execution time of the activation function and N2tree is that of the excitation function. Similarly, each time a tree structure Treebesti has been selected, we should generate a parameter population with a size of m0i . If the selected tree structure has n01i non-leaf nodes and n02i leaf nodes, for the whole parameter population, the execution times of the activation function and the excitation function are m0i n01i and m0i n01i n02i . If the number of parameter 0 optimization iteration for Treebesti is li and l tree structures have been selected, then the execution time of the activation function and the excitation function are

N1param and

l X i1

li m0i n01i

0

19

L. Peng et al. / Parallel Computing 37 (2011) 653666l X i1

661

N2param

0 li m0i n01i n02i ;

20

where N1param is the execution time of the activation function and N2param is that of the excitation function. In this paper, PIPE was used to optimize tree structures of FNT model. PIPE is a random optimization algorithm. Each time we generated a tree structure population, the structures in this population were not the same to the structures in other populations. So if we execute two parallel FNT programs on the same data set, the computational costs of these two executions should not be the same. This makes it difcult to compute the speed-up ratio and efciency. So the computational costs must be taken into account in the parallel performance evaluation of PE-FNT. We use a new factor namly computational scale (CS) for the the performance evaluation of our parallel FNT programs which is dened as:

CS N1tree N2tree N1param N2param :Based on the computational scale, modied speedup ratio performance evaluation which are dened as follows: S0P , efciency E0P and numerical efciency NE0P

21are used for the

S0P

T S CSP ; T P CSS 0 S E0P P ; P T S CSP 0 ; NEP C T CSS

22 23 24

where CSS is the computational scale of the serial FNT program, and CSP is the computational scale of the parallel FNT program. 4. Experiments 4.1. Data sets In this paper, six well-known data sets were used for the performance evaluation. We chose a small data set, two medium data sets and three large data sets for the experiments. All these data sets come from UCI Machine Learning Repository [2] except NIFTY data set. The sizes of these six data sets are listed in Table 1. (1) NIFTY: This is a stock index modeling data set. This data set comes from stock data for the NIFTY index from January 1998 to December 2001. Each instance in the data set has ve inputs and one output. The training set includes 400 instances, and the testing set includes 384 instances. (2) Abalone: The age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope. This is not a easy task. But other measurements such as length, weight can be obtained easily. And these measurements can be used to predict the age. In this data set, there are eight measurements as inputs and the output is the rings (also means the age of abalone). The total data set contains 4177 instances. We selected randomly 3200 instances for training. (3) Internet Advertisements (Ad): This data set represents a set of possible advertisements on Internet pages. There are 458 advertisement instances and 2821 non-advertisement instances in the data set. It is a typical binary classication problem. There are 1558 features as inputs and the output is ad or nonad. The features encode the geometry of the image (if available) as well as phrases occuring in the URL, the images URL and alt text, the anchor text, and words occuring near the anchor text. We selected randomly 320 advertisement instances and 1970 non-advertisement instances for training. (4) CalIt2 Building People Counts (P. Count): The data of this data set were collected from the main door of the CalIt2 building at UCI. They describe the relationship among time, events and people counts. Observations come from 2 data streams (people ow in and out of the building), over 15 weeks, 48 time slices per day (half hour count aggregates). There are 10080 instances and 4 features in the data set. We selected randomly 7000 instances for training.Table 1 Data sets used for the experiments. Data set NIFTY Abalone Internet Advertisements (Ad) CalIt2 Building People Counts (P. Count) KDD99 Adult # of instances 784 4177 3279 10080 11982 48842 Size of the training set 400 3200 2290 7000 5092 30000 # of features 6 8 1558 4 41 14

662

L. Peng et al. / Parallel Computing 37 (2011) 653666

(5) KDD99: The original data of this data set were prepared by the 1998 DARPA intrusion detection evaluation program by MIT Lincoln Lab. The data set has 41 attributes for each connection record plus one class label. Some features are derived features, which are useful in distinguishing normal from attacks. This data set has ve different classes, namely Normal, DOS, R2L, U2R, and Probe. DOS, R2L, U2R, and Probe are four types of attacks. DOS is the Denial of Service attack, R2L is the Remote to Local attack and U2R is the User to Root attack. The training and test comprise 5092 and 6890 records, respectively. This data set is widely used for intrusion detection researches. (6) Adult: This data set is also called Census Income data set. The data were extracted by Barry Becker from the 1994 Census database. The task is to predict whether income exceeds $50 K/yr based on census data. There are 48842 instances and 14 features. We selected randomly 30000 instance for training. 4.2. Methodology Two parallel PE-FNT programs were developed in this research using MPI. One is based on the phase parallel model and is called as PPM program. The other is based on working pool model and is called as WPPM program. These programs were run on 5 homogeneous nodes of Dawning 4000L high performance computer cluster of University of Jinan. Each node has 4 homogeneous CPU and 2G physical memory, the OS is Red Hat Linux 9 and the parallel environment is MPICH2. Since our target is to evaluate the performances of PE-FNT algorithm, it is not necessary to take the prediction/classication accuracy into account and only the training process is accounted. For each data set, the serial FNT program rstly ran on the training set and the execution time and computational scale of the serial program were recorded. Then PPM program ran 4 times. We ran 4 parallel processes for the rst time, 6 for the second time, 8 for the third time and 10 for the last time. WPPM program also ran 4 times on 4, 6, 8 and 10 parallel processes. For each time a parallel program ran, the wall time, CPU time (CT), idle time (DT) and computational scale were recorded. At last, the speedup ratios, efciencies and numerical efciencies were calculated using formula (22)(24). The parallel efciencies was calculated by the standard method. The population size (the number of individuals contained in a population) is an important parameter for both PIPE and PSO algorithm. The optimal solution could not be effectively searched if the population size is too small. Too small population size will also easily result in a premature convergence. But there also have some problems if the population size is too large. Firstly, the randomness of solution searching will be strengthened. This will make that PIPE and PSO tend to be pure random search algorithms. Another problem is the computational costs. There is still no theoretical method to determine population sizes in PIPE and PSO algorithms. In fact, population sizes were determined by researchers experiences in most actual applications. In this paper, different tree structure (using PIPE) and parameter (using PSO) population sizes were used for different data sets, which are shown in Table 2. For comparison, we also designed a parallel typical feed-forward neural network and a parallel RBF neural network by dividing up training sets. This method is similar to the proposal of [23]. Owing to the limitation of instance numbers, only the tree large data sets were used. For each data set, the training set were divided into 4, 6, 8 and 10 subset uniformly. We also used 4, 6, 8 and 10 parallel processes which is the same to the experiment settings of PE-FNT. PSO algorithm was also used for the network parameters (weights and center positions for RBF-NN) optimization. There was one layer in the feedforward NN. Gaussian density function was used in the RBF-NN. The experiment settings for parallel NN and RBF-NN are shown in Table 3. The serial programs of these two neural networks rstly ran and then the parallel versions ran four times upon 4, 6, 8 and 10 parallel processes. Speedup ratios, efciencies, numerical efciencies and parallel efciencies were computed for comparison. It should be noted that the tree structures were not the same each time PE-FNT programs ran. This is the reason for which the computational scale was introduced. But the structures of these two neural networks are static which means the computational scales are the same each time these NN programs run. So comparisons of speedup ratios, efciencies, numerical efciencies and parallel efciencies between PE-FNT and these two parallel NNs are coarse comparisons. 4.3. Results and analysis Fig. 3 shows the speedup ratios of PE-FNT programs using PPM model and WPPM model. The left part shows the results of PPM model and the right part shows the results of WPPM model. For the NIFTY data set, the speedup ratios were kept at low

Table 2 Population size. Data set NIFTY Abalone Ad P. Count KDD99 Adult Tree structure population size 20 30 30 40 40 40 Parameter population size 30 30 40 30 40 30

L. Peng et al. / Parallel Computing 37 (2011) 653666

663

level between 1 and 2. It means that the wall time was not reduced signicantly using parallel programs for small data sets. But for large data sets like P. count, KDD99 and Adults, there were signicant improvements of speedup ratios than small and medium data sets. The highest speedup ratio hit a value of 7.84. It can be seen from the gure that the larger a data set was, the higher the speedup ratio was. It means the PE-FNT programs are more effective for large data sets than small data sets. It also can be seen that the speedup ratios of WPPM model improved more signicantly than that of PPM model, especially for large data sets. This is in accordance with the analysis in Section 3.1. Efciencies, numerical efciencies and parallel efciencies are shown in Figs. 46. All these three indexes are high and stable for large data sets. The numerical efciencies of P. Count, KDD99 and Adults data sets are all greater than 0.75. But for NIFTY data set, all the numerical efciencies are not greater than 0.55. So we can infer that most time of PE-FNT programs was used for computing for large data sets. But for small data sets, PE-FNT programs had to spend much time on communication and waiting. For most data sets, the average values of efciencies, numerical efciencies of WPPM model and parallel efciencies are higher than that of PPM model. This again proved that WPPM can divide parallel tasks more efciently than PPM model.

Table 3 Experiment settings for parallel NN and RBF-NN. Data set NN # of node in hidden layer P. Count KDD99 Adult 30 20 30 Parameter population size 30 40 30 RBF-NN # of node in hidden layer 20 20 20 Parameter population size 30 40 30

Fig. 3. Speedup ratios of PE-FNT programs.

Fig. 4. Efciencies of PE-FNT programs.

664

L. Peng et al. / Parallel Computing 37 (2011) 653666 Table 4 Speedup ratios of parallel NN, parallel RBF-NN and PE-FNT of WPPM. Data set /# of process P. Count/4 P. Count/6 P. Count/8 P. Count/10 KDD99/4 KDD99/6 KDD99/8 KDD99/10 Adult/4 Adult/6 Adult/8 Adult/10 Parallel NN 2.35 3.73 4.86 6.27 2.24 3.31 4.58 5.51 2.72 4.79 6.06 6.94 Parallel RBF-NN 2.64 3.59 4.69 5.90 2.19 3.73 4.67 5.60 2.91 3.80 5.19 6.35 PE-FNT of WPPM 2.77 4.63 5.25 6.74 2.17 3.69 4.79 5.44 3.17 5.51 6.99 7.48

Table 5 Efciencies of parallel NN, parallel RBF-NN and PE-FNT of WPPM. Data set/# of process P. Count/4 P. Count/6 P. Count/8 P. Count/10 KDD99/4 KDD99/6 KDD99/8 KDD99/10 Adult/4 Adult/6 Adult/8 Adult/10 Parallel NN 0.59 0.62 0.61 0.63 0.56 0.55 0.57 0.55 0.68 0.80 0.76 0.69 Parallel RBF-NN 0.66 0.60 0.59 0.59 0.55 0.62 0.58 0.56 0.73 0.63 0.65 0.64 PE-FNT of WPPM 0.69 0.77 0.66 0.67 0.54 0.62 0.60 0.54 0.79 0.92 0.87 0.75

Fig. 5. Numerical Efciencies of PE-FNT programs.

Tables 4 and 5 show the comparisons of speedup ratios and efciencies amon parallel NN, parallel RBF-NN and PE-FNT of WPPM model. For P. Count data set and Adult data set, the speedup ratios and efciencies of PE-FNT are all higher than that of parallel NN and parallel RBF-NN. But for KDD99 data set, the order is not the same: The speedup ratios and efciencies of PE-FNT are little lower than that of parallel RBF-NN. It should be noted that KDD99 data set is the smallest one among these tree data sets. So it can be inferred that PE-FNT is effective for large data sets.

L. Peng et al. / Parallel Computing 37 (2011) 653666

665

Fig. 6. Parallel Efciencies of PE-FNT programs.

5. Conclusions In this paper, we designed a parallel algorithm named PE-FNT algorithm for exible neural tree. As implementations of this algorithm, two parallel programs were developed based on MPI. One was designed using the phase parallel model and the other was designed using working pool parallel model. A small data set, two medium data sets and three large data sets were used for the performance evaluation of these parallel programs. We also compared PE-FNT programs with a parallel NN and a parallel RBF-NN. Experimental results show that PE-FNT algorithm is an effective parallel FNT algorithm especially for large data set. And the experimental results also show that the working pool parallel model based program was more effective than the phase parallel model based program in most cases. Acknowledgements This work was supported by the National Natural Science Foundation of China under contract numbers 60903176, 60873089 and 61070130, the Provincial Natural Science Foundation for Outstanding Young Scholars of Shandong under contract number JQ200820, the Program for New Century Excellent Talents in University under contract number NCET-10-0863, the Key Subject (Laboratory) Research Foundation of Shandong Province XTD0709, and the Science and Technology Program of Shandong Provincial Education Department under contract number J08LJ18, the Scientic Research Foundation for the Excellent Middle-Aged and Youth Scientists of Shandong Province of China under Grant No. BS2009DX037. References[1] A. Abraham, Evolutionary computation in intelligent web management. Evolutionary computing in data mining, in: A. Ghosh, L. Jain (Eds.), Studies in Fuzziness and Soft Computing, Springer, 2004, pp. 189210. [2] A. Asuncion, D.J. Newman, UCI Machine Learning Repository, 2009, . [3] S. Baluja, Population-based Incremental Learning: A Method for Integrating Genetic Search Based Function Optimization and Competitive Learning, Technical Report CMU-CS-94-163, Carnegie Mellon University, Pittsburgh, 1994. [4] A.D. Blas, A. Jagota, R. Hughey, Optimizing neural networks on SIMD parallel computers, Parallel Computing 31 (2005) 97115. [5] Y. Chen, B. Yang, J. Dong, Nonlinear systems modelling via optimal design of neural trees, International Journal of Neural System 14 (2004) 125138. [6] Y. Chen, B. Yang, J. Dong, A. Abraham, Time series forecasting using exible neural tree model, Information Sciences 174 (2005) 219235. [7] Y. Chen, B. Yang, A. Abraham, Flexible neural trees ensemble for stock index modeling, Neurocomputing 70 (2007) 697703. [8] Y. Chen, F. Chen, J.Y. Yang, Evolving MIMO exible neural trees for nonlinear system identication, in: IC-AI, 2007, pp. 373377. [9] Y. Chen, A. Abraham, B. Yang, Hybrid exible neural tree based intrusion detection systems, International Journal of Intelligent Systems 22 (4) (2007) 337352. [10] J.N. Choi, S.K. Oh, W. Pedrycz, Structural and parametric design of fuzzy inference systems using hierarchical fair competition-based parallel genetic algorithms and information granulation, International Journal of Approximate Reasoning 49 (2008) 631648. [11] D. Coyle, G. Prasad, T.M. McGinnity, Faster self-organizing fuzzy neural network training and a hyperparameter analysis for a braincomputer interface, IEEE Transactions on Systems, Man, And Cybernetics-Part B: Cybernetics 39 (6) (2009) 14581471. [12] M.J. Er, W. Chen, S. Wu, High-speed face recognition based on discrete cosine transform and rbf neural networks, IEEE Transactions on Neural Networks 16 (3) (2005) 679691. [13] P.A. Estevez, H. Paugam-Moisy, D. Puzenat, M. Ugarte, A scalable parallel algorithm for training a hierarchical mixture of neural experts, Parallel Computing 28 (2002) 861891. [14] Z. Feng, B. Zhou, J. Shen, A parallel hierarchical clustering algorithm for PCs cluster system, Neurocomputing 70 (2007) 809818. [15] A. Georgieva, Ivan Jordanov, intelligent visual recognition and classication of cork tiles with neural networks, IEEE Transactions on Neural Networks 20 (4) (2009) 675685. [16] S. Gustafson, E.K. Burke, The speciating island model: an alternative parallel evolutionary algorithm, Journal of Parallel Distribution and Computation 66 (2006) 10251036. [17] M.M. Islam, M.A. Sattar, M.F. Amin, X. Yao, K. Murase, A new constructive algorithm for architectural and functional adaptation of articial neural networks, IEEE Transactions on Systems, Man, And Cybernetics-Part B: Cybernetics 39 (6) (2009) 15901605. [18] J. Kennedy, R.C. Eberhart, Particle swarm optimization, in: Proceedings of IEEE International Conference on Neural Networks, 1995, pp. 19421948.

666

L. Peng et al. / Parallel Computing 37 (2011) 653666

[19] J. Kennedy, SmallWorlds and mega-minds: effects of neighborhood topology on particle swarm performance, in: Proceedings of the 1999 Congress of Evolutionary Computation, 3, 1999, pp. 19311938. [20] E. Kolman, M. Margaliot, Knowledge extraction from neural networks using the all-permutations fuzzy rule base: the LED display recognition problem, IEEE Transactions on Neural Networks 18 (3) (2007) 925931. [21] T. Krink, J. Vesterstrom, J. Riget. Particle swarm optimization with spatial particle extension, in: Proceedings of the Congress on Evolutionary Computation, 2002. [22] D. Lim, Y.S. Ong, Y. Jin, B. Sendhoff, B.S. Lee, Efcient hierarchical parallel genetic algorithms using grid computing, Future Generation Computer Systems 23 (2007) 658670. [23] F. Morchen, Analysis of speedup as function of block size and cluster size for parallel feed-forward neural networks on a beowulf cluster, IEEE Transactions on Neural Networks 15 (2) (2004) 515527. [24] M. Petr, L. Adriel, R. Marekt, W. -S Loren, Immune programming, Information Sciences 176 (2006) 9721002. [25] P.K.H. Phua, D. Ming, Parallel nonlinear optimization techniques for training neural networks, IEEE Transactions on Neural Networks 14 (6) (2003) 14601467. [26] S. Qu, Z. Liu, G. Cui, B. Zhang, S. Wang, Modeling of cement decomposing furnace production process based on exible neural tree, in: Proceedings of the 2008 International Conference on Information Management, Innovation Management and Industrial Engineering, 2008, pp. 128133. [27] S. Qu, Z. Liu, G. Cui, Q. Wang, Process parameters selection of uid industry based on exible neural trees, Journal of Communication and Computer 5 (9) (2008) 7479. [28] R.P. Salustowicz, J. Schmidhuber, Probabilistic incremental program evolution, Evolutionary Computation 2 (5) (1997) 123141. [29] U. Seiffert, Articial neural networks on massively parallel computer hardware, Neurocomputing 57 (2004) 135150. [30] A.L.P. Tay, J.M. Zurada, L. Wong, J. Xu, The hierarchical fast learning articial neural network (HieFLANN) An autonomous platform for hierarchical neural network construction, IEEE Transaction on Neural Networks 18 (6) (2007) 16451657. [31] H. Yoshida, K. Kawata, Y. Fukuyama, S. Takayama, Y. Nakanishi, A particle swarm optimization for reactive power and voltage control considering voltage security assessment, IIEEE Transaction on Power Systems 15 (2000) 12321239. [32] G.P. Zhang, Neural networks for classication: a survey, IEEE Transactions on System, Man, and Cybernetics Part C Applications and Reviews 30 (4) (2000) 451462. [33] J. Zhou, Y. Liu, Y. Chen, ICA Based on KPCA and hybrid exible neural tree to face recognition, in: Proceedings of the 6th International Conference on Computer Information Systems and Industrial Management Applications, 2007, pp. 245250.