Time Series Forecasting With Feed-Forward Neural Networks:

Eric Plummer

Computer Science Department

University of Wyoming

April 8, 2023

Time Series Forecasting WithFeed-Forward Neural Networks:

Guidelines And Limitations

April 8, 2023 Eric Plummer 2

TopicsTopics

• Thesis Goals• Time Series Forecasting• Neural Networks• K-Nearest-Neighbor• Test-Bed Application• Empirical Evaluation• Data Preprocessing• Contributions• Future Work• Conclusion• Demonstration


Thesis GoalsThesis Goals

• Compare neural networks and k-nearest-neighbor for time series forecasting

• Analyze the response of various configurations to data series with specific characteristics

• Identify when neural networks and k-nearest-neighbor are inadequate

• Evaluate the effectiveness of data preprocessing


Time Series Forecasting –Time Series Forecasting –DescriptionDescription

• What is it?– Given an existing data series, observe or model the

data series to make accurate forecasts

• Example data series– Financial (e.g., stocks, rates)

– Physically observed (e.g., weather, sunspots)

– Mathematical (e.g., Fibonacci sequence)


Time Series Forecasting –Time Series Forecasting –DifficultiesDifficulties

• Why is it difficult?– Limited quantity of data

• Observed data series sometimes too short to partition

– Noise • Erroneous data points• Obscuring component

– Moving Average

– Nonstationarity• Fundamentals change over time• Nonstationary mean: “Ascending” data series

– First-difference preprocessing

– Forecasting method selection • Statistics• Artificial intelligence


Time Series Forecasting –Time Series Forecasting –ImportanceImportance

• Why is it important?– Preventing undesirable events by forecasting the

event, identifying the circumstances preceding the event, and taking corrective action so the event can be avoided (e.g., inflationary economic period)

– Forecasting undesirable, yet unavoidable, events to preemptively lessen their impact (e.g., solar maximum w/ sunspots)

– Profiting from forecasting (e.g., financial markets)


Neural Networks – Neural Networks – BackgroundBackground

• Loosely based on the human brain’s neuron structure• Timeline

– 1940’s – McCulloch and Pitts – proposed neuron models in the form of binary threshold devices and stochastic algorithms

– 1950’s & 1960’s – Rosenblatt – class of learning machines called perceptrons

– Late 1960’s – Minsky and Papert – discouraging analysis of perceptrons (linearly separable classes)

– 1980’s – Rumelhart, Hinton, and Williams – generalized delta rule for learning by back-propagation for training multilayer perceptrons

– Present – many new training algorithms and architectures, but nothing “revolutionary”


Neural Networks –Neural Networks –ArchitectureArchitecture

• A feed-forward neural network can have any number of:– Layers– Units per layer– Network inputs– Network outputs

• Hidden layers (A, B)• Output layer (C)


Neural Networks –Neural Networks –UnitsUnits

• A unit has:– Connections– Weights– Bias– Activation function

• Weights and bias are randomly initialized before training

• Unit’s input consists of:– Sum of the products of each connection

value and associated weight– Add the bias

• Input is then fed into unit’s activation function

• Unit’s output is the output of activation function

– Hidden layers: Sigmoid– Output layer: Linear


Neural Networks –Neural Networks –TrainingTraining

• Partition data series into:– Training set– Validation set (optional)– Test set (optional)

• Typically, the training procedure is:– Perform backpropagation training with training set– After n epochs, compute total squared error on training set

and validation set– If consistently validation error and training error , stop

training.• Overfitting: Training set learned too well• Generalization: Given inputs not in training and validation sets,

able to accurately forecast



• Backpropagation training:– First, examples in the form of <input, output> pairs are

extracted from the data series– Then, the network is trained with backpropagation on the

examples:1. Present an example’s input vector to the network inputs and

run the network sequentially forward2. Propagate the error sequentially backward from the output layer 3. For every connection, change the weight modifying that

connection in proportion to the error

– When all three steps have been performed for all examples, one epoch has occurred

– Goal is to converge to a near-optimal solution based on the total squared error



Backpropagation training cycle


Neural Networks –Neural Networks –ForecastingForecasting

• Forecasting method depends on examples

• Examples depend on step-ahead size

If step-ahead size is one: Iterative forecasting

If step-ahead size is greater than one: Direct forecasting



Iterative forecasting

Can continue this indefinitely



Directly forecasting n steps

This is the only forecast


K-Nearest-Neighbor –K-Nearest-Neighbor –ForecastingForecasting

• No model to train• Simple linear

search• Compare

reference to candidates

• Select k candidates with lowest error

• Forecast is average of k next values


Test-Bed Application –Test-Bed Application –FORECASTERFORECASTER

• Written in Visual C++ with MFC• Object-oriented• Multithreaded• Wizard-based• Easily modified• Implements feed-forward neural networks & k-

nearest-neighbor• Used for time series forecasting• Eventually will be upgraded for classification

problems

Empirical Evaluation – Data SeriesEmpirical Evaluation – Data Series

Original

0

5

10

15

20

25

30

35

0 7 14 21 28 35 42 49 56 63 70 77 84 91 98 105

112

119

126

133

140

147

154

161

168

175

182

189

196

203

210

Data Point

Va

lue

Original with Less Noisy

-5

0

5

10

15

20

25

30

35

0 7 14

21

28

35

42

49

56

63

70

77

84

91

98

105

112

119

126

133

140

147

154

161

168

175

182

189

196

203

210

Data Point

Va

lue

Original Less Noisy

Original with More Noisy

-10

-5

0

5

10

15

20

25

30

35

40

0 7 14 21 28 35 42 49 56 63 70 77 84 91 98 105

112

119

126

133

140

147

154

161

168

175

182

189

196

203

210

Data Point

Va

lue

Original More NoisyOriginal with Ascending

0

10

20

30

40

50

60

0 7 14

21

28

35

42

49

56

63

70

77

84

91

98

105

112

119

126

133

140

147

154

161

168

175

182

189

196

203

210

Data Point

Va

lue

Original Ascending

Sunspots 1784-1983

0

20

40

60

80

100

120

140

160

180

200

178

4

179

1

179

8

180

5

181

2

181

9

182

6

183

3

184

0

184

7

185

4

186

1

186

8

187

5

188

2

188

9

189

6

190

3

191

0

191

7

192

4

193

1

193

8

194

5

195

2

195

9

196

6

197

3

198

0

Year

Co

un

t

Original

More Noisy

Less Noisy

Ascending

Sunspots


Empirical Evaluation –Empirical Evaluation –Neural Network ArchitecturesNeural Network Architectures

• Number of network inputs based on data series

• Need to make unambiguous examples

• For “sawtooths”:– 24 inputs are necessary– Test networks with 25 &

35 inputs– Test networks with 1

hidden layer with 2, 10, & 20 hidden layer units

– One output layer unit

• For sunspots:– 30 inputs– 1 hidden layer with 30

units• For real-world data series,

selection may be trial-and-error!


Empirical Evaluation –Empirical Evaluation –Neural Network TrainingNeural Network Training

• Heuristic method:– Start with aggressive

learning rate– Gradually lower learning

rate as validation error increases

– Stop training when learning rate cannot be lowered anymore

• Simple method:– Use conservative

learning rate– Training stops when:

• Number of training epochs equals the epochs limit -or-

• Training error is less than or equal to error limit


Empirical Evaluation –Empirical Evaluation –Neural Network ForecastingNeural Network Forecasting

• Metric to compare forecasts: Coefficient of Determination– Value may be (-, 1]– Want value between 0

and 1, where 0 is forecasting the mean of the data series and 1 is forecasting the actual value

– Must have actual values to compare with forecasted values

• For networks trained on original, less noisy, and more noisy data series, forecast will be compared to original series

• For networks trained on ascending data series, forecast will be compared to continuation of ascending series

• For networks trained on sunspots data series, forecast will be compared to test set


Empirical Evaluation –Empirical Evaluation –K-Nearest-NeighborK-Nearest-Neighbor

• Choosing window size analogous to choosing number of neural network inputs

• For sawtooth data series:– k = 2

– Test window sizes of 20, 24, and 30

• For sunspots data series:– k = 3

– Window size of 10

• Compare forecasts via coefficient of determination


Empirical Evaluation –Empirical Evaluation –Candidate SelectionCandidate Selection

• Neural networks– For each training method, data series, and

architecture, 3 candidates were trained

– Also, average of 3 candidates’ forecasts was taken: forecasting by committee

– Best forecast was selected based on coefficient of determination

• K-nearest-neighbor– For each data series, k, and window size, only one

search was performed (only one needed)

Empirical Evaluation – Original Data SeriesEmpirical Evaluation – Original Data Series

Nets Trained on Original

-10

-5

0

5

10

15

20

25

30

35

21

6

21

9

22

2

22

5

22

8

23

1

23

4

23

7

24

0

24

3

24

6

24

9

25

2

25

5

25

8

26

1

26

4

26

7

27

0

27

3

27

6

27

9

28

2

28

5

Data Point

Va

lue

Original 35,2 35,10 35,20


-5

0

5

10

15

20

25

30

35

21

6

21

9

22

2

22

5

22

8

23

1

23

4

23

7

24

0

24

3

24

6

24

9

25

2

25

5

25

8

26

1

26

4

26

7

27

0

27

3

27

6

27

9

28

2

28

5

Data Point

Va

lue

Original 35,2 35,10 35,20


-150

-100

-50

0

50

100

150

216

219

222

225

228

231

234

237

240

243

246

249

252

255

258

261

264

267

270

273

276

279

282

285

Data Point

Va

lue

Original 25,10 25,20

K-Nearest-Neighbor on Original

0

5

10

15

20

25

30

35

216

219

222

225

228

231

234

237

240

243

246

249

252

255

258

261

264

267

270

273

276

279

282

285

Data Point

Va

lue

Original 2,20 2,24 2,30

Simple NNHeuristic NN

Smaller NN K-N-N

Empirical Evaluation – Less Noisy Data SeriesEmpirical Evaluation – Less Noisy Data Series


K-N-N

Nets Trained on Less Noisy

-10

-5

0

5

10

15

20

25

30

35

40

216

219

222

225

228

231

234

237

240

243

246

249

252

255

258

261

264

267

270

273

276

279

282

285

Data Point

Va

lue

Original 35,2 35,10 35,20

Nets Trained on Less Noisy

-20

-10

0

10

20

30

40

216

219

222

225

228

231

234

237

240

243

246

249

252

255

258

261

264

267

270

273

276

279

282

285

Data Point

Va

lue

Original 35,2 35,10 35,20

K-Nearest-Neighbor on Less Noisy

0

5

10

15

20

25

30

35

216

219

222

225

228

231

234

237

240

243

246

249

252

255

258

261

264

267

270

273

276

279

282

285

Data Point

Va

lue

Original 2,20 2,24 2,30

Empirical Evaluation – More Noisy Data SeriesEmpirical Evaluation – More Noisy Data Series


K-N-N

Nets Trained on More Noisy

-20

-10

0

10

20

30

40

50

60

216

219

222

225

228

231

234

237

240

243

246

249

252

255

258

261

264

267

270

273

276

279

282

285

Data Point

Va

lue


Nets Trained on More Noisy

-30

-20

-10

0

10

20

30

40

50

60

216

219

222

225

228

231

234

237

240

243

246

249

252

255

258

261

264

267

270

273

276

279

282

285

Data Point

Va

lue


K-Nearest-Neighbor on More Noisy

-10

-5

0

5

10

15

20

25

30

35

216

219

222

225

228

231

234

237

240

243

246

249

252

255

258

261

264

267

270

273

276

279

282

285

Data Point

Va

lue

Original 2,20 2,24 2,30

Empirical Evaluation – Ascending Data SeriesEmpirical Evaluation – Ascending Data Series

Simple NNHeuristic NNNets Trained on Ascending

0

10

20

30

40

50

60

70

216

219

222

225

228

231

234

237

240

243

246

249

252

255

258

261

264

267

270

273

276

279

282

285

Data Point

Va

lue

Ascending 35,10 35,20

Nets Trained on Ascending

0

10

20

30

40

50

60

70

216

219

222

225

228

231

234

237

240

243

246

249

252

255

258

261

264

267

270

273

276

279

282

285

Data Point

Va

lue

Ascending 35,2 35,10 35,20

Empirical Evaluation – Longer ForecastEmpirical Evaluation – Longer Forecast

Nets Trained on Less Noisy (Longer Forecast)

-80

-60

-40

-20

0

20

40

60

216

221

226

231

236

241

246

251

256

261

266

271

276

281

286

291

296

301

306

311

316

321

326

331

336

341

346

351

356

Data Point

Va

lue

Original 35,2 35,10 35,20

Nets Trained on More Noisy (Longer Forecast)

-100

-50

0

50

100

150

216

221

226

231

236

241

246

251

256

261

266

271

276

281

286

291

296

301

306

311

316

321

326

331

336

341

346

351

356

Data Point

Va

lue


Heuristic NN

Empirical Evaluation – Sunspots Data SeriesEmpirical Evaluation – Sunspots Data Series

Sunspots 1950-1983

-50

0

50

100

150

200

250

19

50

19

52

19

54

19

56

19

58

19

60

19

62

19

64

19

66

19

68

19

70

19

72

19

74

19

76

19

78

19

80

19

82

Year

Co

un

t

Test Set 30,30 Neural Net 3,10 K-Nearest-Neighbor

Simple NN & K-N-N


Empirical Evaluation –Empirical Evaluation –DiscussionDiscussion

• Heuristic training method observations:

– Networks train longer (more epochs) on smoother data series like the original and ascending data series

– The total squared error and unscaled error are higher for noisy data series

– Neither the number of epochs nor the errors appear to correlate well with the coefficient of determination

– In most cases, the committee forecast is worse than the best candidate's forecast

• When actual values are unavailable, choosing the best candidate is difficult!



• Simple training method observations:

– The total squared error and unscaled error are higher for noisy data series with the exception of the 35:10:1 network trained on the more noisy data series

– The errors do not appear to correlate well with the coefficient of determination

– In most cases, the committee forecast is worse than the best candidate's forecast

– There are four networks whose coefficient of determination is negative, compared with two for the heuristic training method

Coefficient of Determination Comparison

-0.5

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Original Less Noisy More Noisy Ascending

Data Series

Co

eff

icie

nt

of

De

term

ina

tio

n

35,2 35,10 35,20

Coefficient of Determination Comparison

-0.5

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Original Less Noisy More Noisy Ascending

Data Series

Co

eff

icie

nt

of

De

term

ina

tio

n

35,2 35,10 35,20



• General observations:– One training method did not appear to be clearly better – Increasingly noisy data series increasingly degraded the forecasting

performance– Nonstationarity in the mean degraded the performance– Too few hidden units (e.g., 35:2:1) forecasted well on simpler data

series, but failed for more complex ones– Excessive numbers of hidden units (e.g, 35:20:1) did not hurt

performance– Twenty-five network inputs was not sufficient– K-nearest-neighbor was consistently better than the neural networks – Feed-forward neural networks are extremely sensitive to architecture

and parameter choices, and making such choices is currently more art than science, more trial-and-error than absolute, more practice than theory!


Data PreprocessingData Preprocessing

• First-difference– For ascending data series, a neural network trained on first-

difference can forecast near perfectly– In that case, it is better to train and forecast on first-

difference– FORECASTER reconstitutes forecast from its first-difference

• Moving average– For noisy data series, moving average would eliminate much

of the noise– But would also smooth out peaks and valleys– Series may then be easier to learn and forecast– But in some series, the “noise” may be important data (e.g.,

utility load forecasting)


ContributionsContributions

• Filled a void within feed-forward neural network time series forecasting literature: know how networks respond to various data series characteristics in a controlled environment

• Showed that k-nearest-neighbor is a better forecasting method for the data series used in this research

• Reaffirmed that neural networks are very sensitive to architecture, parameter, and learning method changes

• Presented some insight into neural network architecture selection: selecting number of network inputs based on data series

• Presented a neural network training heuristic that produced good results


Future WorkFuture Work

• Upgrade FORECASTER to work with classification problems

• Add more complex network types, including wavelet networks for time series forecasting

• Investigate k-nearest-neighbor further• Add other forecasting methods, (e.g., decision trees

for classification)


ConclusionConclusion

• Presented:– Time series forecasting

– Neural networks

– K-nearest-neighbor

– Empirical evaluation

• Learned a lot about the implementation details of the forecasting techniques

• Learned a lot about MFC programming


DemonstrationDemonstration

Various files can be found at:http://w3.uwyo.edu/~eplummer

xHiddenc

P

ppcpcHiddenc e

xhwherebwihO

1

1)(

1,,

xxhwherebwihO Outputc

P

ppcpcOutputc

)(

1,,

))(( ccOutputc ODxh

N

ncnnHiddenc wxh

1,)(

pcpc Ow ,

Unit Output, Error, and Weight Unit Output, Error, and Weight Change FormulasChange Formulas

xthanforecastworseaisxifk

xxgenerallyif

xthanforecastbetteraisxifk

xxiif

r

i

i

i

ii

ˆ0

ˆ0

ˆ10

ˆ1

2

C

cccC ODE

1

2

2

1

C

cccC UOUDUE

1

n

ii

n

iii

xx

xxr

1

2

1

2

2

)(

)ˆ(1

Forecast Error FormulasForecast Error Formulas


Related WorkRelated Work

• Drossu and Obradovic (1996): hybrid stochastic and neural network approach to time series forecasting

• Zhang and Thearling (1994): parallel implementations of neural networks and memory-based reasoning

• Geva (1998): multiscale fast wavelet transform and an array of feed-forward neural networks

• Lawrence, Tsoi, and Giles (1996): encodes the series with a self-organizing map and uses recurrent neural networks

• Kingdon (1997): automated intelligent system for financial forecasting and uses neural networks and genetic algorithms

Time Series Forecasting With Feed-Forward Neural Networks:

Documents

Transcript of Time Series Forecasting With Feed-Forward Neural Networks: