Efficiency of local models ensembles for time series prediction

Expert Systems with Applications 38 (2011) 6884–6894

Contents lists available at ScienceDirect

Expert Systems with Applications

journal homepage: www.elsevier .com/locate /eswa

Efficiency of local models ensembles for time series prediction

David Martínez-Rego, Oscar Fontenla-Romero, Amparo Alonso-Betanzos ⇑Laboratory for Research and Development in Artificial Intelligence (LIDIA), Department of Computer Science, Faculty of Informatics, University of A Coruña,Campus de Elviña s/n, 15071 A Coruña, Spain

a r t i c l e i n f o a b s t r a c t

Keywords:Time series analysisMachine learningNeural networksClustering

0957-4174/$ - see front matter � 2010 Elsevier Ltd. Adoi:10.1016/j.eswa.2010.12.036

⇑ Corresponding author.E-mail address: [email protected] (A. Alonso-Betan

Many current technological challenges require the capacity of forecasting future measurements of a phe-nomenon. This, in most cases, leads directly to solve a time series prediction problem. Statistical modelsare the classical approaches for tackling this problem. More recently, neural approaches such as Back-propagation, Radial Basis Functions and recurrent networks have been proposed as an alternative. Mostneural-based predictors have chosen a global modelling approach, which tries to approximate a goalfunction adjusting a unique model. This philosophy of design could present problems when data isextracted from a phenomenon that continuously changes its operational regime or represents distinctoperational regimes in a unbalanced manner. In this paper, two alternative neural-based local modellingapproaches are proposed. Both follow the divide and conquer principle, splitting the original predictionproblem into several subproblems, adjusting a local model for each one. In order to check their adequacy,these methods are compared with other global and local modelling classical approaches using threebenchmark time series and different sizes (medium and high) of training data sets. As it is shown, bothmodels demonstrate to be useful pragmatic paradigms to improve forecasting accuracy, with the advan-tages of a relatively low computational time and scalability to data set size.

� 2010 Elsevier Ltd. All rights reserved.

1. Introduction

Nowadays, many knowledge areas and scientific disciplines likePhysics, Engineering and Economy generate data that can be con-sidered as a time series. Due this reason, time series analysis isbroadly used at present in many knowledge areas and with differ-ent purposes: prediction, process control, process simulation, etc. Itis also one of the principal tasks of data mining. Time series, from ageneral point of view, consist in a sequence of measurements cor-responding to a phenomenon observed along the time and there-fore chronologically ordered. In this work we will specificallydeal with the problem of time series prediction.

Time series prediction is the process of forecasting a future mea-surement of a phenomenon by analyzing the relationship between apattern of past values and a current or future value (Box, Jenkins, &Reinsel, 1994). This problem has been broadly studied by statisticsalong the last decades and several models have been developed.Autoregressive Moving Average Models (ARMA) and AutoregressiveIntegrated Moving Average Models (ARIMA) (Box et al., 1994) devel-oped by George Box and Gwilym Jenkins and its variants stand outamong statistical approaches. These models have been applied withsuccess to several real problems in the past (Di Giacinto, 2006;Olafsson & Sigbjornsson, 1996).

ll rights reserved.

zos).

Since the late 1990s statisticians and mathematicians havebroadly accepted machine learning methods as an approach to timeseries prediction and other mathematical problems. The reason isthat these methods have demonstrated to be powerful non-linearestimators applicable to many real-world problems, outperformingother mathematical models (Van Veelen, Nijhuis, & Spaanenburg,2000). Two main areas of machine learning have approached thetime series prediction problem, artificial neural networks (ANNs)and kernel methods. There are two main trends for the design ofmachine learning models for time series prediction:

� Monolithic or global modeling approaches: They consist in aunique model trained to obtain the minimum global error forall presented patterns. This group contains kernel methodsdesigned for this purpose, such as Support Vector Regression(SVR) (Smola & Schölkopf, 1998) or Least Squares Support VectorMachines (LS-SVM) (Suykens & Vandewalle, 1999; Suykens, VanGestel, De Brabanter, De Moor, & Vandewalle, 2002), and most ofthe neural-based predictors. The latter can be classified in twomain groups: static networks, which try to solve time series pre-diction problem through on-line learning, that is, giving to thenetwork an explicit notion of time (time variable is an input ofthe network) and dynamic networks, which tackle the problemby supplying the network with a pattern of past values as inputor by including feedback connections in the network. Represen-tative models of the dynamic networks approach are Tapped

http://dx.doi.org/10.1016/j.eswa.2010.12.036

mailto:[email protected]

http://dx.doi.org/10.1016/j.eswa.2010.12.036

http://www.sciencedirect.com/science/journal/09574174

http://www.elsevier.com/locate/eswa

D. Martínez-Rego et al. / Expert Systems with Applications 38 (2011) 6884–6894 6885

Delay Line Multilayer Perceptron (TDL-MLP) (Haykin, 1999; VanVeelen et al., 2000), Time Delay Neural Networks (TDNN) (Wai-bel, Hanazawa, Hinton, Shikano, & Lang, 1988), Elman Networks(Elman, 1990) and Jordan Networks (Jordan, 1986). Most ofthese neural-based models are derived from the classical Multi-layer Perceptron, adapting it to have the ability of managingtemporal information. Global approaches could have some prob-lems when data being processed have any of these properties:– Data are extracted from a phenomenon that changes its

operational regime. If this happens, it will be difficult to gen-eralize all operational regimes with only one model.

– Different operational regimes are represented in the data ina unbalanced manner. With a monolithic approach, less rep-resented patterns may not be taken into considerationbecause they are not important in order to reduce the globalerror of the model.

� Modular or local modeling approaches: These approaches consistin a set of local models, each trained to obtain the minimumerror for a subset of the available data. The final system is com-posed of a set of local experts, each one trained to solve a sub-problem of the original, that contains all the possible cases. Inother words, goal predictive function f is estimated as the unionof several local estimators

f ðxÞ ¼[M

i¼0

f̂ iðxÞ; ð1Þ

where local estimators f̂ iðxÞ are defined in different regions of theinput space. This kind of models were developed for the first timeby Nowlan, Hinton, Jacobs, and Jordan (1991) and Jacobs and Jordan(1993) in the early 1990s and, since the end of this decade, are con-sidered promising techniques in time series prediction. In the tax-onomy of multiple classifier systems proposed in Kuncheva (2004),this strategy is called classifier selection and it is differentiated fromthe classifier fusion strategy. In the latter, all the local models haveknowledge of the whole feature space and their outputs are joinedby a nontrainable combiner (such as a majority vote scheme) or atrainable combiner (such a Multilayer Perceptron). With a classifierselection philosophy, special cases and different operational regimesare learnt by different local experts, each one reducing the errorcommitted in its region of the input space. This property is speciallyimportant in real-life time series prediction problems where data isoften non stationary and include several special cases. These ap-proaches have often a handicap: high training times due to the needof training a large set of local models.

In this paper, two methods for undertaking a time series predic-tion problem are presented. The proposed methods are designed totake into account the different regimes of a time series and improvethe global error committed by the network through decreasing thelocal errors committed by each local model. These methods can beincluded into the modular approaches group. It will be shown thatthe methods proposed exhibit low training times despite of havingto train a large set of local expert models. This fact demonstrates thatboth approaches are pragmatic paradigms to improve accuracy intime series prediction and scale up to long time series cases. As it willbe shown, the proposed models do not only perform better in accu-racy/time than other local and global models, but also it is importantto remark that they obtain a better scalability ratio. That is, all meth-ods generally perform better in large size data sets, but the proposedones although degrade their performance when data set size dimin-ishes, are less affected by this fact than the other methods.

The rest of the paper is organized as follows: Section 2 presentsin detail the two proposed methods; Section 3 presents the resultsof an empirical comparative study of our methods in two artificialdatasets, Lorenz and Henon time series; and one real dataset,Dow–Jones time series. The results were compared with that ob-

tained by: (a) two local modeling approaches and (b) two classicalglobal modeling approaches commonly used.

In the last section, a brief discussion of the results obtained inthis work and the future research lines will be included as the con-cluding part of this paper.

2. Proposed methods

2.1. General description

The proposed architectures are based in the application of aclustering algorithm combined with one-layer neural networks.They consist mainly on the following stages:

� Stage 1: In this stage, the time series x(t) is embedded from theoriginal one-dimensional space into a N-dimensional recon-struction space.� Stage 2: Patterns generated in the previous phase are divided in

groups using a clustering method. The goal is to divide thereconstruction space generated by stage 1 in several clustersand then, to assign each cluster or zone to a local model whichwill specialize only in the patterns that belong to this region.� Stage 3: A local model or expert network is trained for each cluster

defined in stage 2. One-layer neural networks trained with a fastalgorithm described by Fontenla-Romero, Alonso-Betanzos, Cas-tillo, and Guijarro-Berdiñas (2002) are used to generate localmodels.� Stage 4: This stage is used only in the second proposed architec-

ture. A fusion strategy is used to obtain a more robust and accu-rate system. The expert networks are grouped in committees ofexpert networks. For each committee, a one-layer neuralnetwork is trained to learn an adequate weighting of the outputsof its expert networks. The weighted final output of a committeewill be, in most of cases, more accurate than the one obtainedby a single expert network. This kind of fusion approximationwas proposed for the first time in Wolpert (1992) for classifica-tion problems and called stacked generalization. It is a more flex-ible approach than fixed fusion strategies, like the ones proposedin Kuncheva (2002), finding an appropriate weighting of the out-puts of the local models for each zone of the input space.

Next two subsections explain in detail the two proposedarchitectures.

2.2. Distributed Local Experts (DLE)

This architecture is composed of the following three blocks:

1. An embedding layer implemented by a Time Delay Line. Thislayer embeds the original one-dimensional time series, x(t), intothe reconstruction space. The output of this layer is a set S of N-dimensional state vectors created from the input signal withthe form x(t) = [x(t),x(t � r1), x(t � r2), . . .,x(t � rN)]T, where ri

represents the delays of the series selected to generate the pre-diction. Fig. 1 shows an example of transformation from theinput space to a reconstruction space. In this case, original timeseries is converted to a set of three-dimensional patterns ortime windows composed by past values of the series.

2. A clustering method that divides the reconstruction space in dif-ferent subregions. The input of this layer is the pair (d(t),x(t)),where d(t) represents the desired output of the system. In ourtime series prediction scenario, d(t) = x(t + p), where p is thedesired prediction step. In this work, three specific clusteringmethods were used (Self-organizing Maps (SOM) (Kohonen,2001), Growing Neural Gas (GNG) (Fritzke, 1995) and Vector

-20 -15 -10 -5 0 5 10 15 20-20

-100

1020-20

-15

-10

-5

0

5

10

15

20

0 200 400 600 800 1000 1200 1400 1600 1800 2000

-15

-10

-5

0

5

10

15

1st Stage

t

x (t

)

Time Series

x (t

-2)

x(t-1)x(t)

Reconstruction Space

Fig. 1. Time series transformation example.

6886 D. Martínez-Rego et al. / Expert Systems with Applications 38 (2011) 6884–6894

Quantization using Information Theoretic Concepts (VQIT)(Principe, Lehn-Schioler, Hedge, & Erdogmus, 2005)) althoughthe proposed local methods can be applied using any clusteringalgorithm. Fig. 2 shows a possible division of the reconstructionspace of Fig. 1. Big white dots represent the nodes obtained bythe clustering method. Around each node, a cluster is created.Each cluster is formed by the patterns for which that node isthe closest according to a distance measurement (Euclidean dis-tance in our case). SOM and GNG are applied as originallydescribed. However, the VQIT model, namely, has a problem whenthe number of patterns increases, its training time is too high. Toobtain an acceptable training time maintaining a good distribu-tion of the nodes across the embedded space, the original modelpresented in Principe et al. (2005) was modified. The followingtwo changes were made:� Instead of training with the whole training set at each itera-

tion of the algorithm, a random subset of D training datafrom the original set (being D a parameter defined by theuser) is taken. The variation of the algorithm chooses a dif-ferent random sample of the original set for each iteration.

� The learning rate a of the system (originally with a fixedvalue) takes variable values. It starts with high values atthe first iterations to move all the nodes of the system nearthe data. At each iteration, this learning rate is decrementedmultiplying its previous value by a constant b 2 (0,1), inorder to obtain a smooth adjustment of the system oncethe nodes are distributed across the input space.

3. A set of one layer neural networks (one for each node of the clus-tering algorithm) that generate an expert network for each clusterpreviously created. Two properties of the local models are desir-able, a smoothing continuity among neighbor models and theexistence of enough data to train properly each expert network.In order to achieve this, influence areas of expert networks willbe overlapped. Thus, each expert network trains with the pattersowned by its associated node an by its L closest neighbor nodes.Fig. 3 depicts schematically an example of an expert network gen-eration from the clusters created in Fig. 2. In this example, eachexpert network trains with the patterns owned by its associatednode and by its closest neighbor node. The single layer neuralnetworks of this stage are trained using the powerful and fast

-20 -15 -10 -5 0 5 10 15 20-20

-100

10

20-20

-15

-10

-5

0

5

10

15

20

x(t-1)x(t)


-20 -15 -10 -5 0 5 10 15 20-20

-100

1020-20

-15

-10

-5

0

5

10

15

20

x(t-1)x(t)

Divided Reconstruction Space

2nd Stage

x (t

-2)

x (t

-2)

Fig. 2. Reconstruction space division example.


training algorithm by Fontenla-Romero et al. (2002). This algo-rithm always obtains the global optimum in a direct (not itera-tive) manner solving a system of I + 1 linear equations andunknowns (where I is the number of inputs of the network).The cost function minimized by this network used as local modelis

Error ¼XS

i¼1

ðf 0ð�dÞ � ��sÞ2; ð2Þ

where f0 is the derivative of the neural activation function, d thedesired output, �d ¼ f�1ðdÞ; ��s ¼ �ds � ðwT xþ bÞ;w is the weightvector and b the bias. Thanks to this algorithm, the DLE modelovercomes the handicap, common in local modeling, of hightraining times.

Algorithm 1 details the training process of the architecture.Training parameters set h depends on the clustering method se-

lected, because each method receives different input parameters.Once the nodes of C are adjusted, it could be the case that someof them are far from the input space defined by the training pat-terns or that there are many redundant nodes in a zone. Thesenodes would make the system work improperly, and thus are de-leted in step 4. In step 5, patterns owned by useless or redundantnodes change its owner to the closest neighbor of the deleted node.Finally, in step 6 local models are trained. To obtain them, it is nec-essary to ensure that each model has at least nR + 1 data to train,being nR the number of delays in R. If this condition is not fulfilled,the algorithm selected to train the expert networks (Fontenla-Romero et al., 2002) would not be able to construct an accurate lo-cal model. Finally, remark that in step 6a, the way of selecting thetraining sets for the local models produces two effects:

� It ensures that each node has enough data to train a local modelwith the algorithm selected for this purpose.

Expert net 1

Local models set

Out

put

-20 -15 -10 -5 0 5 10 15 20-20

-10

0

10

20-20

-15

-10

-5

0

5

10

15

20

x(t-

2)

x(t-1)x(t)


Expert net 2

Expert net 3

Expert net 4

Expert net 5

Expert net 6

Fig. 3. Each cluster has an associated local model (expert network) that obtains a partial solution of the whole problem.


� It produces an overlapping between the areas corresponding toeach node. This ensures a smoothing continuity among neigh-bor models.

Being M the number of clusters, jSj the number of patterns in S,jRj the number of delays in R and CC the complexity of the cluster-ing algorithm selected, the complexity of the DLE algorithm isO(max(MjSj,CC)). The first term of the max expression comes formstep 3. Thanks to the algorithm selected for the local models, thecomplexity of step 6b is O(MjRj2) and given that in most casesjRj � jSj this is not the most costly step of the training phase (asit is the case in other ensemble based models).

Algorithm 1. Distributed Local Experts training algorithm

Input: Time series x, temporal delays set R, desired predictionstep p, elimination boundary g, number of neighbors L,training parameters set of the clustering method h.

1. Generate from time series x a set S of vectors with the fol-lowing structure: (d(t),x(t)), where d(t) = x(t + p) andx(t) = [x(t � r1), x(t � r2), . . . ,x(t � (rN))]T, ri 2 R.

2. Create NC clusters using clustering method C, trainingparameters set h and training patterns set S.

3. For each vector of S, calculate which is its closest cluster.4. Delete the nodes of C that own less than g training vectors.5. If there are deleted nodes in step 4, repeat step 3 for the

remaining nodes; else, go to next step.6. For each not deleted node ci of C:

(a) Construct a training set Ti with the vectors of S that areowned by this node and by its L closest neighbors.

(b) Train for this set Ti a single layer neural network usingthe algorithm proposed in Fontenla-Romero et al.(2002). For each pair (d(t),x(t)) 2 Ti use x(t) as inputand d(t) as desired output.

Once training of the DLE is completed, the system works as
follows:
1. New values of the time series are measured to generate a newpattern x into the reconstruction space. This pattern must havethe same structure than the patterns used in the training.

2. Using the distance measure, the closest node to the pattern, ci, isselected. In this process, the desired output is not used, since itis unknown.

3. The single layer neural network associated with the closestnode ci generates the prediction of the system using x as input.

Fig. 4 depicts this process. Steps 1–3 correspond respectively tothe points explained previously.

2.3. Distributed committees of local experts (DCLE)

In the last two decades, the theory for constructing ensemblesof neural networks has been developed within many diverse re-search communities. The results obtained by these works revealthat combining through a fusion rule several learning methodstrained for the same data achieves better results than using a un-ique method, which generally does not extract all the informationexisting in the data. Between the most commonly used fusionrules, two main categories can be distinguished, fixed fusion rulesand trainable fusion rules (Raudys, 2006). The first ones are simplecombination rules like a simple mean or a median of the outputsof the base experts; in the latter case, several approaches havebeen proposed such as support vector machines or single-layerneural networks. Using trainable fusion rules leads to more com-plex decision rules in the fusion stage which obtain more accu-rate results when the base experts do not work perfectly alongthe different regions of the input space. In our case, the influenceareas of the distinct expert networks are overlapped. As a result of

-20-10

010

20

-20

-10

0

10

20-20

-15

-10

-5

0

5

10

15

20

x(t)

Local models set

Out

put

)2-t(x

x(t-1) x(t)


x

New pattern

1

2

3

Expert net 1

Expert net 2

Expert net 3

Expert net 4

Expert net 5

Expert net 6

Fig. 4. Prediction generation example for a new pattern.

1 If this condition is not fulfilled, the solution achieved by solving the system ofcard (committeei) + 1 linear equations would not be accurate.


this overlapping, for each zone there exist several expert networksthat can generate a prediction for a new input pattern belongingto that zone. If the outputs of all the expert networks that tacklethe same subregion are combined through a fusion rule, the globalresponse of the system may be more accurate than consideringonly the output of the expert network associated with the closestnode (as in the DLE case).

So, our second proposed architecture is composed of the threeblocks of the DLE, plus a fourth block which consists in a set ofcommittees of expert networks (one for each node of the clusteringmethod). Each committee of expert networks is composed by neigh-bor local models whose areas of influence are overlapped and hasan associated weighting network. Each weighting network will takeas input, the outputs of the expert networks that belong to the com-mittee. For an input pattern, its closest node is selected. The globalresponse of the system is a weighted combination of the outputs ofthe expert networks that belong to the committee associated withthat closest node. Fig. 5 shows how the final output of the systemis a combination of the outputs of the expert networks that belongto the active committee.

Algorithm 2 details the training process of this model. The in-puts are the same than for the DLE algorithm and steps 1–5 werealready explained for the previous DLE model. In step 6 a commit-tee of expert networks is constructed for each subregion. Each nodeci of C adds to the set TrainingNodesi itself and its L nearest neigh-bors. The patterns owned by the nodes cj 2 TrainingNodesi will beused by each node to train its associated local model or expertnetwork. Once TrainingNodesi is constructed for all nodes, ci isadded to the committees of the nodes cj 2 TrainingNodesi. Thus,each committee is formed by all the local models trained for thesubregion of the input space owned by its associated node. Theweighting networks of each committee will take the outputs of its

belonging expert networks as inputs. For the weighting networks,single layer neural networks trained with the algorithm proposedin Fontenla-Romero et al. (2002) were used. In order to obtainaccurate results with this algorithm, it is needed that each nodeowns at least card (committeei) + 1 patterns, where card (commit-teei) is the number of nodes that belong to the committee linkedto that node.1 So, if there are not enough patterns in the subre-gion of a node ci to train a weighting network, in step 7 the num-ber of expert networks that belong to the committee is decreased,erasing those associated with the furthest nodes to ci. If a nodecj is erased from a committee, the node ci associated with thatcommittee is also erased from TrainingNodesi. This ensures thatthe expert network associated with the node cj will not use thepatterns of ci to train.

Once the committees are formed, the expert networks associatedwith each node are trained in step 8. Each local model takes astraining set the patterns that are owned by the nodes in the setTrainingNodes of its associated node. Once local models are trained,in step 9 a weighting is adjusted for each committee. In order to dothis, for each committeei: the set of patterns Ei owned by its associ-ated node ci are taken, the outputs of the local models that belongto committeei for the set Ei are generated and the weighting networkassociated to committeei is trained taking the outputs of the localmodels as input and the desired outputs of the patterns in Ei asdesired outputs.

The complexity of the DCLE algorithm is O(max(MjSj,CC)) as inthe DLE case.

Local models fusion

Weig . Network

Weig . Network

Local models set

tuptuO

-20 -15 -10 -5 0 5 10 15 20-20

-100

1020-20

-15

-10

-5

0

5

10

15

20

)2-t(x

x(t-1)x(t)


Expert net 1

Expert net 2

Expert net 3

Expert net 4

Expert net 5

Expert net 6

committee 4

committee 5

Fig. 5. Distinct expert networks outputs are combined to generate a more accurate final output.


Algorithm 2. Distributed Committees of Local Experts trainingalgorithm

6. For each not deleted node ci of the clustering method C:(a) Add ci and its L closest nodes to the list TrainingNodesi.(b) For all the nodes cj 2 TrainingNodesi, add ci in

committeej.7. For each not deleted node ci of the clustering method C:

(a) Calculate the number of patterns li that are owned byci.

(b) If li < card (committeei) +1,
(i) Reduce the number of nodes included in com-
mitteei, maintaining ci and its li � 2 closestnodes belonging to committeei.

(ii) For the nodes cj deleted from committeei in theprevious step, erase ci from TrainingNodesj.

8. For each not deleted node ci of the clustering method C, fol-low this process:(a) Construct a training set Ti with the patterns of S that

are owned by the nodes cj 2 TrainingNodesi.(b) Train a one-layer neural network with the training

algorithm proposed in Fontenla-Romero et al. (2002)for the training set Ti.

9. For each not deleted node ci of the clustering method C:(a) Construct a training set Ei with the patterns of S owned

by ci.(b) Train, for the node ci, a weighting network using:
� The outputs of all {expertnetworkj,cj 2 committeei}
for patterns in Ei as inputs� Desired outputs of the patterns in Ei as desired

outputs.

Once the model is trained, it works as follows (graphically de-

2 The Matlab� source code of the proposed methods is available in http://ww.dc.fi.udc.es/lidia/downloads/TSLM.

picted in Fig. 6):

1. New values of the time series are measured to generate a newpattern x into the reconstruction space.

2. The closest node ci to x is selected. In this process only x is used,since the desired output is unknown.

3. All the expert networks that belong to the committee linked tothe closest node ci are activated and generate their predictionsusing x as input.

4. The weighting network associated with ci is activated and, takingas input the values generated in the previous step by the expertnetworks, generates the final output of the system.

3. Experimental results

The proposed methods were empirically compared with twoother local models: (a) a local modeling approach proposed inFontenla-Romero, Alonso-Betanzos, Castillo, Principe, andGuijarro-Berdiñas (2002), which in this work is called DistributedLocal Models (DLM); (b) with the classical mixture of experts modeldeveloped by Jacobs and Jordan (1993) adapted to regressionproblems by Weigend, Mangeas, and Srivastava (1995); and also(c) with two commonly used global modeling approaches: TappedDelay Line Multilayer Perceptron (TDL-MLP) trained with theScaled Conjugate Gradient Algorithm (Møller, 1993) and �-SupportVector Regression (Smola & Schölkopf, 1998). All methods wererun in Matlab 7.0 (R14). Specifically, the following implementa-tions were used: for �-SVR the Spider Toolbox 1.71 (Weston,Elisseeff, BakIr, & Sinz, 2006), for TDL-MLP the implementation ofthe Matlab Neural Networks Toolbox, for mixture of experts animplementation by Perry Moerland (Mitchell, 2007) and for DLM,DLE and DCLE an own implementation.2 In order to make the com-parisons, three benchmark time series were employed: Hènon(1976), Wan (2005), Wan (2005), Lorenz (1963) and Ley (1996).For all of them the goal is to predict the next future value using fiveprevious values. We used two different sizes, medium and large,for the datasets in order to analyze its influence on theperformance of the methods. The number of samples were 1500(medium) and 15000 (large) for Henon; 3000 and 30,000 for

w

http://www.dc.fi.udc.es/lidia/downloads/TSLM

http://www.dc.fi.udc.es/lidia/downloads/TSLM

-20-10

010

20

-20

-10

0

10

20-20

-15

-10

-5

0

5

10

15

20

)2-t(x

x(t-1) x(t)


x

New pattern

1

2

3

4

Weig . Network

Weig . Network

tuptuO

Expert net 1

Expert net 2

Expert net 3

Expert net 4

Expert net 5

Expert net 6

committee 4

Fig. 6. Execution of Distributed Committees of Local Experts model.

Table 1Mean and standard deviation of Normalized Mean Squared Error (NMSE), mean andstandard deviation of training time for Henon time series.

NMSE ± STD Train. T. ± STD

Medium SVR 2.01 � 10�2 ± 2.40 � 10�3 0.06 ± 0.01TDL-MLP 7.59 � 10�2 ± 5.09 � 10�1 5.45 ± 0.56DLM 1.01 � 10�1 ± 5.72 � 10�2 9.95 ± 0.12MIX. of EXP. 2.90 � 10�3 ± 1.30 � 10�3 29.38 ± 1.22DLE-SOM 3.10 � 10�2 ± 1.18 � 10�2 9.66 ± 0.56DLE-GNG 1.72 � 10�2 ± 5.20 � 10�3 15.47 ± 8.84DLE-VQIT 1.92 � 10�2 ± 7.20 � 10�3 6.86 ± 0.12DCLE-SOM 1.22 � 10�4 ± 3.10 � 10�4 9.76 ± 0.57DCLE-GNG 3.08 � 10�5 ± 7.22 � 10�5 15.09 ± 7.16DCLE-VQIT 3.50 � 10�3 ± 7.70 � 10�3 8.02 ± 0.47

Large SVR 1.90 � 10�2 ± 6.90 � 10�4 0.50 ± 0.02TDL-MLP 7.18 � 10�2 ± 5.00 � 10�1 231.10 ± 33.70DLM 5.43 � 10�4 ± 5.65 � 10�4 26.13 ± 0.25MIX. of EXP. (IX) 5,40 � 10�2 ± 1,99 � 10�2 601.03 ± 80.27DLE-SOM 4.43 � 10�5 ± 1.03 � 10�5 32.08 ± 0.49DLE-GNG 2.40 � 10�3 ± 3.80 � 10�3 33.72 ± 26.27DLE-VQIT 2.36 � 10�5 ± 7.60 � 10�6 88.50 ± 0.46DCLE-SOM 2.76 � 10�7 ± 1.58 � 10�7 33.96 ± 0.57DCLE-GNG 1.26 � 10�6 ± 2.55 � 10�6 34.37 ± 27.15DCLE-VQIT 1.86 � 10�5 ± 3.33 � 10�5 95.61 ± 1.14


Lorenz and 1000 and 10,000 for Dow–Jones. The output neurons ofthe TDL-MLP, the DLE and the DCLE used logarithmic sigmoidalfunctions. Due to this, the time series were normalized to the inter-val [0.05,0.95].

Since it might be argued that the results obtained were the con-sequence of particular combinations of training parameters, thefollowing different configurations were tried for each method:

� For �-SVR, the method described in Cherkassky and Ma (2002)was used to select the variance r of the RBF kernel and the softmargin C. The � value of the �-insensitive function was varied inthe set {0.1,0.05,0.025,0.0125}.� For TDL-MLP, topologies with a hidden layer and with logistic

sigmoidal activation functions. The number of neurons of thehidden layer was varied between 5 and 50.� For DLM, a Self Organizing Map (SOM) with a 20 � 20 topology

trained using 2000 epochs was employed in the large datasetscase and a 7 � 7 topology trained using the same number ofepochs was used in the medium datasets case. The number ofneighbors used was varied between 7 and 27 for the large data-sets case and between 7 and 17 for the medium datasets case.� For mixture of experts, single-layer neural networks trained with

the Scaled Conjugate Gradient Algorithm (Møller, 1993) wereused as base models. A Gaussian mixture model was selectedfor the gate and the number of hidden base models was variedbetween 5 and 30.� For the DLE and the DCLE the value of g was set to 2 and the

number of neighbors L was varied between 7 and 27 in the largedatasets case and between 7 and 17 in the medium datasetscase. For the lage datasets, a 20 � 20 topology for the SOMand 400 nodes for the VQIT and the GNG were used. For themedium datasets, a 7 � 7 topology for the SOM and 50 nodesfor the VQIT and the GNG were used.

The configurations that obtained the best performance amongthe ones tried were selected. To estimate the best configurationfor each method, ten 10-fold crossvalidations were run. The perfor-mances of the different methods were obtained by calculating for

all its configurations the mean Normalized Mean Squared Error(NMSE) (Weigend & Gershenfeld, 1994) over the 100 trials, andselecting the configuration with the lowest mean error. These er-rors are included in the Tables 1–3 for the Henon, Lorenz andDow–Jones time series, respectively.

Furthermore, in order to evaluate the efficiency of the methods,the CPU time was measured using a personal computer with a2.13 GHz Intel processor and 2 GB of main memory. The memoryrequirements of the mixture of experts method in some large data-sets cases were much higher than the requirements of the rest ofmethods. Due to this fact, in these cases it was necessary to runthe trials of the mixture of experts method in a machine with an In-tel Xenon 2.66 GHz processor and 8 GB of main memory.

Mean CPU times (in s) obtained for the three datasets by all thedifferent methods are also included in the Tables 1–3. The trials

Table 2Mean and standard deviation of Normalized Mean Squared Error (NMSE), mean andstandard deviation of training time for Lorenz time series.


Medium SVR 1.46 � 10�2 ± 3.10 � 10�3 0.025 ± 0.01TDL-MLP 1.56 � 10�4 ± 8.82 � 10�5 13.29 ± 2.74DLM 3.34 � 10�4 ± 1.54 � 10�4 10.05 ± 0.11MIX. of EXP. 3.02 � 10�5 ± 8.19 � 10�6 58.92 ± 2.61DLE-SOM 1.34 � 10�4 ± 7.11 � 10�5 10.16 ± 2.24DLE-GNG 9.80 � 10�5 ± 4.70 � 10�5 41.93 ± 11.08DLE-VQIT 2.58 � 10�4 ± 9.95 � 10�5 3.27 ± 0.03DCLE-SOM 2.11 � 10�5 ± 3.64 � 10�5 10.43 ± 2.34DCLE-GNG 5.24 � 10�5 ± 7.40 � 10�5 63.54 ± 15.23DCLE-VQIT 4.19 � 10�4 ± 3.01 � 10�4 8.79 ± 1.93

Large SVR 1.13 � 10�2 ± 7.08 � 10�4 0.23 ± 0,03TDL-MLP 9.27 � 10�5 ± 5.67 � 10�5 197.45 ± 20.48DLM 9.52 � 10�6 ± 4.11 � 10�6 27.59 ± 0.44MIX. of EXP. (IX) 4.14 � 10�6 ± 3.31 � 10�7 1929.0 ± 200.7DLE-SOM 5.17 � 10�6 ± 1.52 � 10�6 56.26 ± 0.78DLE-GNG 5.22 � 10�6 ± 3.83 � 10�6 54.47 ± 9.25DLE-VQIT 5.59 � 10�6 ± 1.79 � 10�6 115.69 ± 0.81DCLE-SOM 1.63 � 10�6 ± 2.33 � 10�6 59.67 ± 0.90DCLE-GNG 2.25 � 10�6 ± 2.59 � 10�6 53.37 ± 7.58DCLE-VQIT 8.13 � 10�6 ± 7.72 � 10�6 124.34 ± 1.60

Table 3Mean and standard deviation of Normalized Mean Squared Error (NMSE), mean andstandard deviation of training time for Dow–Jones time series.


Medium SVR 6.70 � 10�3 ± 2.20 � 10�3 0.02 ± 0.01TDL-MLP 2.53 � 10�2 ± 2.42 � 10�1 5.83 ± 0.17DLM 7.60 � 10�3 ± 1.17 � 10�2 9.94 ± 0.16MIX. of EXP. 1.10 � 10�3 ± 5.88 � 10�4 4.37 ± 0.04DLE-SOM 1.10 � 10�3 ± 5,16 � 10�4 9.97 ± 0.19DLE-GNG 1.10 � 10�3 ± 5,45 � 10�4 10.17 ± 5.18DLE-VQIT 1.10 � 10�3 ± 5,92 � 10�4 8.79 ± 0.53DCLE-SOM 5.80 � 10�3 ± 7,70 � 10�3 10.03 ± 0.19DCLE-GNG 5.30 � 10�3 ± 4,50 � 10�3 10.61 ± 2.49DCLE-VQIT 5.20 � 10�3 ± 9,90 � 10�3 8.12 ± 0.10

Large SVR 1,35 � 10�2 ± 3.90 � 10�3 0.06 ± 0.01TDL-MLP 3.39 � 10�2 ± 3,35 � 10�1 111.91 ± 21.17DLM 1.00 � 10�3 ± 6,61 � 10�4 25.49 ± 0.24MIX. of EXP. 4.13 � 10�4 ± 2.09 � 10�4 186.83 ± 4.42DLE-SOM 4.33 � 10�4 ± 2.11 � 10�4 28.33 ± 0.46DLE-GNG 5.34 � 10�4 ± 2.16 � 10�4 6.19 ± 0.60DLE-VQIT 4.68 � 10�4 ± 2,13 � 10�4 82.13 ± 0.44DCLE-SOM 2.60 � 10�3 ± 3.80 � 10�3 29.39 ± 0.52DCLE-GNG 8.36 � 10�4 ± 1.10 � 10�3 6.15 ± 0.63DCLE-VQIT 2.02 � 10�2 ± 1.50 � 10�2 89.98 ± 0.81

Fig. 7. Comparisons in time and NMSE for the Henon series. The time of the mixture of eorder to be able to display the time consumption of the other methods.


that needed to be run in the Intel Xenon computer are accompa-nied by a IX tag. Figs. 7–9 are shown for an easier comparisonamong the performance of the methods regarding error and time.As it can be seen in Fig. 7, the proposed methods present the bestbalance among the two measurements and the two sizes of thedataset, being able to obtain lower error rates in both mediumand large datasets, with an adequate time consumption.

For the Lorenz Time Series, the results obtained are analogous.That is, in Fig. 8, it can be seen that again the smallest errors areobtained by the proposed methods, that are again the ones witha better scaling, and also with an adequate balance between per-formance and time.

The last dataset belongs to a real problem, the Dow–Jones timeseries. In this case, the comparison graph in Fig. 9 shows that theresults obtained by the proposed method and the mixture of ex-perts are equal or similar in NMSE. Regarding the time employed,the mixture of expert model needs an acceptable time in the med-ium size dataset (similar to the one employed by the proposed lo-cal methods), but scales worsen than them in the large size dataset.

A Kruskal–Wallis test (Kruskal & Wallis, 1952) was performedto find out whether the differences between the performances ofall the methods were statistically significant. A level of significanceof a = 0.05 was selected. The contrast concluded that, for all thedatasets used, there were significant differences between the per-formances obtained by the different methods. To establish whichwas the best model for each dataset, a multiple comparison usinga one-way ANOVA for each possible pair of models was performed.The results of the multiple comparisons are detailed in Table 4. Foreach dataset, the different models are organized as follows: (best)the model with the best performance, (equal) the models which re-sults, despite being worse, are not significantly different from thebest ones and (distinct) the models which results are significantlyworse than the best ones.

4. Discussion and future research

In this paper, two local modeling approaches for tackling thetime series prediction problem were presented. The first proposedarchitecture (DLE) is a pure selection model that splits the wholeoriginal problem into several subproblems and finds a global solu-tion solving all these generated subproblems. The second proposedarchitecture (DCLE) consists in an hybrid selection–fusion modelthat in its last stage combines the outputs of neighbor local models

xperts model is much bigger than the limit of the graph, but it has been reduced in

Fig. 8. Comparisons in time and NMSE for the Lorenz series. The time of the mixture of experts model is bigger than the limit of the graph, but it has been reduced in order tobe able to display the time consumption of the other methods.

Fig. 9. Comparisons in time and NMSE for the Dow–Jones series.


with a trainable fusion rule, obtaining a more accurate final output.As supported by the experimental results obtained, the followingfacts can be concluded:

� The proposed local architectures exhibit, for the data setsemployed, a better performance than the previous classicalapproaches. This difference increases when the size of the data-set becomes greater.� Both DLE and DCLE can be combined with any clustering algo-

rithm (a general one or any specifically designed for time seriesdata) in their first stage. Although there exist differences in theperformance depending on the clustering algorithms used, theexperimental results obtained suggest that, generally, they arenot significant.� The training times and computational requirements demanded

by the presented architectures are less than the ones requiredby the classical approaches. It is specially remarkable that, inthis paper, more efficient ensembles of local models have been

obtained. This fact appears clear when comparing the trainingtimes and computational requirements of a classical mixtureof experts model with the ones of the proposed architectures.This difference also becomes greater when the number of inputpatterns increases.� The proposed models obtained the best balanced results

between size of dataset and time employed in all the time seriestested. In particular, DCLE-GNG is the combination that obtainsthe best results.

Although the results obtained in this work show that theproposed DLE and DCLE methods obtain a good performance inan efficient way, there are still some aspects that could be im-proved. Regarding this, we are working in changing the staticscheme of considering the L nearest neighbors to construct thetraining sets of the expert networks and the committees in bothDLE and DCLE. Instead of that, a dynamic approach could be taken,grouping the nodes following inter- and intra-cluster density mea-

Table 4Results of the statistical multiple comparison between the performances of allmethods.

Henon Lorenz Dow–Jones

Medium Best DCLE-GNG DCLE-SOM DLE-SOMEqual DCLE-SOM DCLE-GNG DLE-GNG

DCLE-VQIT MIX. of EXP. DLE-VQITMIX. of EXP. DLE-GNG MIX. of EXP.

DLE-SOM TDL-MLPDistinct TDL-MLP TDL-MLP SVR

SVR SVR DLMDLM DLM DCLE-SOMDLE-SOM DLE-VQIT DCLE-GNGDLE-GNG DCLE-VQIT DCLE-VQITDLE-VQIT

Large Best DCLE-SOM DCLE-SOM MIX. of EXP.DLE-SOM MIX. of EXP. TDL-MLPDLE-VQIT DLE-SOM DLE-SOMDCLE-GNG DLE-GNG DLE-GNGDCLE-VQIT DLE-VQIT DLE-VQIT

DCLE-GNGDistinct TDL-MLP TDL-MLP SVR

SVR SVR DLMDLM DLM DCLE-SOMMIX. of EXP. DCLE-VQIT DCLE-GNGDLE-GNG DCLE-VQIT


sures as the ones proposed in Tasdemir and Merenyi (2007).Thanks to this change, grouping not related nodes can be avoided,hopefully leading to more accurate expert networks and committeesof networks.

Acknowledgements

The authors thank Indra Sistemas, S.A. for their support for thiswork under project 08DPI145E of the Xunta de Galicia. This workwas also supported in part by the Spanish Ministry of Educationand Science (Grant code TIN2006-02402) and by program 2007/000134–0 of the Xunta de Galicia, both partially supported by FED-ER funds. D. Martínez-Rego acknowledges the support of the Span-ish Ministry of Science and Innovation under F.P.U. GrantsProgramm. Finally, they also want to acknowledge the Supercom-puting Center of Galicia (CESGA) for having the opportunity ofusing their computing services.

References

Box, G., Jenkins, G. M., & Reinsel, G. C. (1994). Time series analysis: Forecasting andcontrol. Prentice-Hall.

Cherkassky, V., & Ma, Y. (2002). Practical selection of SVM parameters and noiseestimation for SVM regression. Neural Computation, 17, 113–126.

Di Giacinto, V. (2006). A generalized space–time ARMA model with an applicationto regional unemployment analysis in Italy. International Regional ScienceReview, 29(2), 159–198.

Elman, J. L. (1990). Finding structure in time. Cognitive Science, 14, 179–211.Fontenla-Romero, O., Alonso-Betanzos, A., Castillo, E., & Guijarro-Berdiñas, B.

(2002). A global optimum approach for one-layer neural networks. LectureNotes In Computer Science, 2415, 1429–1449.

Fontenla-Romero, O., Alonso-Betanzos, A., Castillo, E., Principe, J. C., & Guijarro-Berdiñas, B. (2002). Local modeling using self-organizing maps and single layerneural networks. Lecture Notes In Computer Science, 2415, 945–950.

Fritzke, B. (1995). A growing neural gas network learns topologies. Advances inNeural Information Systems, 7, 625–632.

Haykin, S. (1999). Neural networks: A comprehensive foundation. Prentice-Hall.Hènon, T. (1976). A two-dimensional mapping with a strange attractor.

Communications in Mathematical Physics, 50(1), 69–77.Jacobs, R. A., & Jordan, M. I. (1993). Hierarchical mixtures of experts and the EM

algorithm. International Joint Conference on Neural Networks, 1339–1344.Jordan, M. I. (1986). Serial order: A parallel distributed processing approach. Institute

for Cognitive Science Report 8604, University of California.Kohonen, T. (2001). Self-organizing maps. Springer-Verlag.Kruskal, W. H., & Wallis, W. A. (1952). Use of ranks in one-criterion variance

analysis. American Statistical Association, 47, 583–621.Kuncheva, L. I. (2002). A theoretical study on six classifier fusion strategies. IEEE

Transactions on Pattern Analysis and Machine Learning, 24(2), 281–286.Kuncheva, L. I. (2004). Combining pattern classifiers: Methods and algorithms. John

Wiley & Sons.Ley, E. (1996). On the peculiar distribution of the US stock indices first digits. The

American Statistician, 50(4), 311–314.Lorenz, E. N. (1963). Deterministic nonperiodic flow. Journal of Athmospheric Science,

20, 130–143.Mitchell, H. B. (2007). Multi-sensor data fusion. Berlin, Heidelberg: Springer.Møller, M. (1993). A scaled conjugate gradient algorithm for fast supervised

learning. Neural Networks, 6, 525–533.Nowlan, S. J., Hinton, G. E., Jacobs, R. A., & Jordan, M. I. (1991). Adaptative mixtures

of local experts. Neural Computation, 3, 79–87.Olafsson, S., & Sigbjornsson, R. (1996). Application of ARMA models to estimate

earthquake ground motion and structural response. International Journal of RockMechanics and Mining Sciences and Geomechanics Abstracts, 33(3), 951–966.

Principe, J. C., Lehn-Schioler, T., Hedge, A., & Erdogmus, D. (2005). Vector-quantization using information theoretic concepts. Natural Computing, 4, 39–51.

Raudys, S. (2006). Trainable fusion rules I. Large sample size case. Neural Networks,19, 1506–1516.

Smola, A. J. & Schölkopf, B. (1998). A tutorial on support vector regression.NeuroCOLT.

Suykens, J. A. K., & Vandewalle, J. (1999). Least squares support vector machineclassifiers. Neural Processing Letters, 9(3), 293–300.

Suykens, J. A. K., Van Gestel, T., De Brabanter, J., De Moor, B., & Vandewalle, J. (2002).Least squares support vector machines. World Scientific Pub. Co.

Tasdemir, K., & Merenyi, E. (2007). A new cluster validity index for prototype basedclustering algorithms based on inter- and intra-cluster density. Proceedings ofInternational Joint Conference on Neural Networks.

Van Veelen, M., Nijhuis, J., & Spaanenburg, B. (2000). Neural network approaches tocapture temporal information. Computing Anticipatory Systems – ThirdInternational Conference. AIP Conference Proceedings, 517, 361–371.

Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., & Lang, K. (1988). Phonemerecognition using time-delay neural network. Proceedings of the IEEE InteligentConferences on Acoustic, 107–110.

Wan, E. A. (2005). Time series data. OGI School of Science and Engineering. URLhttp://www.cse.ogi.edu/�ericwan/data.html. Last accessed 2.11.2007.

Weigend, A. S., & Gershenfeld, N. A. (1994). Time series prediction: Forecasting thefuture and understanding the past. Addison-Wesley.

Weigend, A. S., Mangeas, M., & Srivastava, A. N. (1995). Nonlinear gated experts fortime series: Discovering regimes and avoiding overfitting. International Journalof Neural Systems, 6, 373–379.

Weston, J., Elisseeff, A., BakIr, G., Sinz, F. 2006. Spider SVM Toolbox. URL http://www.kyb.tuebingen.mpg.de/bs/people/spider/, Last access: 2-11-2007. MaxPlanck Institute for Biological Cybernetics.

Wolpert, D. (1992). Stacked generalization. Neural Networks, 5, 241–259.

http://www.cse.ogi.edu/~ericwan/data.html

http://www.cse.ogi.edu/~ericwan/data.html

http://www.kyb.tuebingen.mpg.de/bs/people/spider/

http://www.kyb.tuebingen.mpg.de/bs/people/spider/

Efficiency of local models ensembles for time series prediction

Documents

Transcript of Efficiency of local models ensembles for time series prediction