Post on 21-Dec-2015
1
Testing neural networks using the complete round
robin method
Boris Kovalerchuk, Clayton Todd, Dan Henderson
Dept. of Computer Science, Central Washington University, Ellensburg, WA, 98926-7520
borisk@tahoma.cwu.edu toddcl@cwu.edu danno@elltel.net
2
Problem
Common practice Pattern discovered by learning methods such
as NN are tested in the data before they are used for their intended purpose.
Reliability of testing The conclusion about reliability of discovered
patterns depends on testing methods used.
Problem A testing method may not test critical
situations.
3
Result New testing method
Tests much wider set of situations than the traditional round robin method
Speeds up required computations
4
Novelty and Implementation
Novelty The mathematical mechanism based on the
theory of monotone Boolean functions and Multithreaded parallel processing.
Implementation The method has been implemented for
backpropagation neural networks and successfully tested for 1024 neural networks by using SP500 data.
5
1. Approach and Method
Common approach 1. Select subsets in the data set D
subset Tr for training and
subset Tv for validating the discovered patterns.
2. repeat step 1 several times for different subsets
3. compare if results are similar to each other than a discovered
regularity can be called reliable for data D.
Tr
Tv
6
Known Methods[Dietterich, 1997]
Random selection of subsets bootstrap aggregation (bagging)
Selection of disjoint subsets cross validated committees
Selection of subsets according the probability distribution Boosting
7
Problems of sub-sampling
Different regularities for different sub-samples of Tr
Rejecting and accepting regularities heavily depends on a specific splitting
of Tr Example:
non-stationary financial time series bear and bull market trends for different
time intervals.
8
Splitting sensitive regularities for non-stationary data
A A’ Cregularity 1(bear market)
B’ B C regularity 2 (bull market)
Independentdata C
Training sample Tr
Reg2 does not work on B’=A
Reg1 does not work on A’=B
9
Round robin method
Eliminates arbitrary splitting by examining several groups of subsets of Tr.
The complete round robin method examines all groups of subsets of Tr.
Drawback 2n possible subsets, where n is the number of
groups of objects in the data set. Learning 2n neural networks is a
computational challenge.
10
Implementation
Complete round robin method applicable to both attribute-based and relational data mining methods.
The method is illustrated for neural networks.
11
Data
Let M be a learning method and D be a data set of N objects,
represented by m attributes. D={di},i=1,…,N, di=(di1, di2,…, dim). Method M is applied to data D for
knowledge discovery.
12
Grouping data
Example -- a stock time series. the first 250 data objects (days) belong to
1980, the next 250 objects (days) belong to 1981,
and so on.
Similarly, half years, quarters and other time intervals can be used.
Any of these subsets can be used as training data.
13
Testing data
Example 1 (general data) Training -- the odd groups #1, #3, #5, #7 and
#9 Testing -- #2, #4, #6, #8 and #10
Example 2 (time series) Training -- #1, #2, #3, #4 and #5 Testing -- #6, #7, #8, #9 and #10
Used third path completely independent later data C for
testing.
14
Hypothesis of monotonicity
D1 and D2 are training data sets and Perform1 and Perform2
the performance indicators of learned models using D1 and D2
Binary Perform index, 1 stands for appropriate performance and 0 stands for inappropriate performance.
The hypothesis of monotonicity (HM) If D1 D2 then Perform1 Perform2.
(1)
15
Assumption
If data set D1 covers data set D2, then performance of method M on D1 should be better or equal to performance of M on D2.
Extra data bring more useful information than noise for knowledge discovery.
16
Experimental testing of the hypothesis
Hypothesis P(M,Di) P(M,Dj) for performance is not always true.
1024 subsets of ten years and generated NN
Found 683 subsets such that Di Dj. Surprisingly, in this experiment
monotonicity was observed for all of 683 combinations of years.
17
Discussion
Experiment strong evidence for use of monotonicity
along with the complete round robin method.
Incomplete round robin method assumes some kind of independence of training subsets used.
Discovered monotonicity shows that independence at least should be tested.
18
Formal notation
The error Er for data set D={di}, i=1,…,N, is the normalized error of all its components di:
T(di) is the actual target value for di d J(di) is the target value forecast delivered
by discovered model J, i.e., the trained neural network in our case.
N
i i
i
N
i i
T
JTEr
1
2
1
)(
))()((
d
dd
19
Performance
Performance is measured by the error tolerance (threshold) Q0 of error Er:
0
0
QErif0,
QErif1, Perform
20
Binary hypothesis of monotonicity
Combinations of years are coded as binary vectors vi=(vi1,vi2…vi10) like
0000011111 with 10 components from (0000000000) to (1111111111)
total 2n=1024 data subsets. Hypothesis of monotonicity
If vi vj then Performi Performj (2) Here
vi vj vik vjk for all k=1,...,10.
21
Monotone Boolean Functions
Not every vi and vj are comparable with each other by the ”” relation.
Perform as a quality indicator Q: Q(M, D, Q0) =1 Perform=1, (3) where Q0 is some performance limit.
Rewritten monotonicity (2) If vi vj then Q(M, Di, Q0) Q(M, Dj, Q0) (4)
Q(M, D, Q0) is a monotone Boolean function of D. [Hansel, 1966; Kovalerchuk et al, 1996]
22
Use of Monotonicity
A method M, Data sets D1 and D2, with D2 D1. Monotonicity
if the method M does not perform well on the data D1, then it will not perform well on the data D2 either.
under this assumption, we do not need to test method M on D2.
24
Experiments with SP500 and Neural Networks
The backpropagation neural network for predicting SP500 [Rao, Rao, 1993]
0.6% prediction error on the SP500 test data (50 weeks) with 200 weeks (about four years) of training data.
Is this result reliable? Will it be sustained for wider training and testing data?
25
Data for testing reliability
The same data:
Training data -- all trading weeks from 1980 to 1989 and
Independent testing data -- all trading weeks from 1990-1992.
26
Complete round robin
Generated all 1024 subsets of the ten training years (1980-
1989) Computed
1024 the corresponding backpropagation neural networks and
their Perform values. 3% error tolerance threshold (Q0=0.03) (higher than 0.6% used for the smaller data set in [Rao, Rao, 1993]).
27
Performance
Table 1. Performance of 1024 neural networksPerformance
Training TestingNumber ofneuralnetworks
% of neuralnetworks
0 0 289 28.250 1 24 2.351 0 24 2.351 1 686 67.05
A satisfactory performance is coded as 1 and non-satisfactory performance is coded as 0.
28
Analysis of performance
Consistent performance (training and testing) 67.05% + 28.25%= 95.3% with 3% error tolerance sound Performance on training data =>
sound Performance on testing data (67.05%) unsound Performance on training data =>
unsound Performance on testing data (28.25%)
A random choice of data for training from ten-year SP500 data will not produce regularity in 32.95% of cases, although regularities useful for forecasting do exist.
29
Specific Analysis
Table 2 9 nested data subsets of the possible
1024 subsets Begin the nested sequence with a
single year (1986). This single year’s data is too small to
train a neural network to a 3% error tolerance.
30
Specific analysis
Table 2. Backpropagation neural networks performance for different data subsets.Training yearsPerformance Binary code
for set ofyears
80818283848586878889TrainingTesting90-92
0000001000x00
0000001001xx00
0000001011xxx00
0000011011xxxx11
0000111011xxxxx11
0001111011xxxxxx11
0011111011xxxxxxx11
0111111011xxxxxxxx11
1111111011xxxxxxxxx11
9 nested data subsets of the possible 1024 subsets
31
Nested sets
A single year (1986) is too small to train a neural network to a 3%
error tolerance. 1986, 1988 and 1989,
produce the same negative result. 1986, 1988, 1989 and 1985,
the error moved below the 3% threshold. Five or more years of data
also satisfy the error criteria. The monotonicity hypothesis is confirmed.
32
Analysis
All other 1023 combinations of years only a few combinations of four years
satisfy the 3% error tolerance, practically all five-year combinations
satisfy the 3% threshold. all combinations of over five years
satisfy this threshold.
33
Analysis
Four-year training data sets produce marginally reliable forecasts. Example.
1980, 1981, 1986, and 1987, corresponding to the binary vector (1100001100) do not satisfy the 3% error tolerance
34
Further analyses
Three levels of error tolerance, Q0: 3.5%, 3.0% and 2.0%.
The number of neural networks with sound performance goes down from
81.82% to 11.24% by moving error tolerance from 3.5% to 2%.
NN with 3.5% error tolerance are much more reliable than networks with 2.0
% error tolerance.
35
Three levels of error tolerance
Table 3. Overall performance of 1024 neural networks with different error tolerancePerformanceError toleranceTraining Testing
Number of neural networks
% of neural networks
3.5% 0 0 167 16.320 1 3 0.2931 0 6 0.5861 1 837 81.81
3.0% 0 0 289 28.250 1 24 2.351 0 24 2.351 1 686 67.05
2.0% 0 0 845 82.600 1 22 2.1501 0 31 3.0301 1 115 11.24
36
Performance
0102030405060708090
<0,0> <1,1> <0,1> <1,0>
Performance <training,testing>
% o
f n
eu
ral n
etw
ork
s
Er. tolerance=3.5% Er.tolerance=3.0% Er. tolerance=2.0%
37
Standard random choice method
A random choice of training data for 2% error tolerance will more often reject
the training data as insufficient.
This standard approach does not even allow us to know how unreliable the result is without running the complete 1024 subsets in complete round robin method.
38
Reliability of 0.6 error tolerance
Rao and Rao [1993] the 0.6% error for 200 weeks (about
four years) from the 10-year training data.
unreliable (table 3)
39
Underlying mechanism
The number of computations depends on a sequence of testing data subsets Di.
To optimize the sequence of testing, Hansel’s lemma [Hansel, 1966, Kovalerchuk et al, 1996] from the theory of monotone Boolean functions is applied to so called Hansel’s chains of binary vectors.
40
Computational challenge of Complete round robin method
Monotonicity and multithreading significantly speed up computing Use of monotonicity with 1023
threads decreased average runtime about 3.5 times from 15-20 minutes to 4-6 minutes in order to train 1023 Neural Networks in the case of mixed 1s and 0s for output.
41
Error tolerance and computing time
Different error tolerance values can change output and runtime. extreme cases with all 1’s or all 0’s as
outputs.. File preparation to train 1023 Neural
Networks. The largest share of time for file
preparation (41.5%) is taken by files using five years in the data subset (table 5).
42
Runtime
Table 1. Runtime for different error tolerance settingsMethod Average time for 1023
Neural Networks1 Processor, no threads
Average time for 1023Neural Networks1 Processor, 1023 threads
Round-Robin with monotonicity formixed 1’s and 0’s as output
15-20 min. 6-4 min.
Round-Robin with monotonicity forall 1’s as output
10 min. 3.5 min.
43
Time distribution
Table 5. Time for backpropagation and file preparation Set of years % of time in File preparation % of time in Backpropagation0000011111 41.5% 58.5%0000000001 36.4% 63.6%1111111111 17.8% 82.2%
44
General logic of software
The exhaustive option generate 1024 subsets using a file
preparation program and compute backpropagation for all subsets
Optimized option generate Hansel chains, store them compute backpropagation only for
specific subsets dictated by the chains.
45
Implemented optimized option
1. For a given binary vector the corresponding training data are produced.
2. Backpropagation is computed generating the Perform value.
3. The Perform value is used along with a stored Hansel chains to decide which binary vector (i.e., subset of data) will be used next for learning neural networks.
46
System Implementation
The exhaustive set of Hansel chains is generated first and stored for a desired case.
A 1024 subsets of data are created using a file preparation program.
Based on the sequence dictated by the Hansel Chains, we produce the corresponding training data and compute backpropagation to generate the Perform value. The next vector to be used is based on the stored Hansel Chains and the Perform value.
47
User Interface and Implementation
Built to take advantage of Windows NT threading.
Built with Borland C++ Builder 4.0. Parameters for the Neural Networks
and sub-systems are setup through the interface.
Various options, such as monotinicity, can be turned on or off through the interface.
48
Multithreaded Implementation
Independent Sub-Processes Learning of an individual neural network
or a group of the neural networks. Each sub-process is implemented as an
individual thread. Run in parallel on several processors to
further speed up computations.
49
Threads
Two different types of threads. Worker Thread and Communication Threads
Communication Threads send and receive the data.
Work threads are started by the Servers to perform the calculations on the data.
50
Client/Server Implementation
Decreases the workload per-computer Client does the data formatting, Server
performs all the data manipulation. A client connects to multiply servers. Each server performs work on the data
and returns the results to the client. The client formats the returned results
and gives a visualization of them.
51
Client
Loads vectors into linked list. Attaches threads to each node. Connects thread to active server. Threads send server activation packet Waits for server to return results. Formats results for outputting. Send more work to server.
52
Server
Obtains a message from the client that this client is connected.
Starts work on data given it from client.
Returns finished packet to client. Loop.