Ensemble Based Classification and Forecasting...
Transcript of Ensemble Based Classification and Forecasting...
Ensemble Based Classification and
Forecasting Methods
Dr P. N. SuganthanSchool of EEE, NTU, Singapore
Some Software Resources Available from:
http://www.ntu.edu.sg/home/epnsugan
TENCON 2016
MBS Singapore, 22nd Nov. 2016
General Concept
S Training
Data
S1 S2 SnMultiple Data Sets
P1 P2 PnMultiple
Predictors
HCombined Decision
Y. Ren, L. Zhang, and P. N. Suganthan, "Ensemble Classification and Regression –
Recent Developments, Applications and Future Directions,“ IEEE Computational
Intelligence Magazine, DOI: 10.1109/MCI.2015.2471235 , Feb 2016.
2
Outline• Classification
– Random forests
– Random Vector Functional Link
– Kernel Ridge Regression
– Benchmarking
• Forecasting– Random vector functional link networks
– Random forests
Non-iterative methods
3
4
Bagging(1)
Sampling from original training data (without
deletion) Usually to the same number of samples
in each bag, as in the original dataset. Around 65%
of distinct samples and 35% duplicates.
Original training data
Bags
Same number of samples
Bagging(2)
1. For i=1: No.Trs // No.Trs is number of trees or bags
a) Draw (with replacement) from the training set to generate the training data Si
b) Learn a classifier Ci on Si .
2. For each test example
a) Try all classifiers Ci
b) Predict the class that receives the highest number of votes.
5
Random Forests
L. Breiman, “Random forests,” Machine learning, vol. 45, no. 1, pp. 5–32, 2001.
Patented with restrictions on commercial usage.6
terminal node.
using d
Decision Tree:
An Example
Classifying an email as either as
spam or a valid email using
Features such as ch! , ch$, 1999,
george, free, our, receive, …
Each node is labeled based on
majority training samples reaching
the node. Training samples of
each class is indicated as x/y
below each node.
A node of decision Tree
1,1 1,2 1,
2,1 2,2 2,
3,1 3,2 3,
1,1 1,2 1,
,1 ,2 ,
d
d
d
n n n d
n n n d
X X X
X X X
X X X
X X X
X X X
Each row is a
data point
Each column is a feature. There
are M features per training sample.
d features are selected randomly.
Threshold
How to optimize the
threshold for each
feature?
Assume that n training samples (each with M features) reach a node.
8
• Gini impurity, the threshold should be optimized so that is maximized.
Optimize threshold for each feature
2
1
2 2
1 1
( ) 1 ( )
( ) [1 ( ) ] [1 ( ) ]
i
i i
cw
beforesplit
i
l rl rc cw w
aftersplit l ri i
nGini t
n
n nn nGini t
n n n n
beforesplit aftersplitGini Gini
There are other impurity criteria such
as information gain, etc.
C -total number of classes at the node
n – total samples
r - right branch
l – left branch
nwi- samples of class wi
9
Rotation Forest
NxM M is the dimensionality of each data.
Rotation Forest
• A distinct rotation matrix for each tree
• All features, all samples are rotated at the root level.
• In the default version, all rotated features are used in each node.
L. Zhang, P N Suganthan, “Random Forests with Ensemble of Feature Spaces,” Pattern Recognition, 47 (10), 3429-3437, 2014. (Codes Available: 2015-TCyb-Oblique-RF
11
12
Rotation Forest Illustrated
13
Rotation Forest(2)
• Base classifiers: decision trees Forest
PCA is a simple rotation of the coordinate axes Rotation Forest
14
Method for Constructing Rotation Matrix
X: the objects in the training data set
x = [x1, x2, …, xM
]T
a data point with M
features
1 1 11 2
1 2
n
N N Nn
x x x
X
x x x
N×M matrix
Y = [y1, y2, …, yN]T
: class labels with c classes
15
Method for Constructing Rotation Matrix
• For i = 1 … L (to construct the training set for
classifier Di)
F : full feature set
Fi,1
Fi,2
Fi,3
…Fi,K
K subsets (Fi,j j=1…K)
each has m = M/K features
16
Method for Constructing Rotation Matrix
• For j = 1 … K
F1,1F1,2
F1,3
…F1,K
X1,1: data set X for the features in F1,1
Eliminate a random
subset of data
Select a bootstrap sample
from X1,1 to obtain X’1,1
Run PCA on X’1,1 using only m (default m=3) features
Principal components a(1)
1,1,…,a(m1)
1,1
17
Method … (Cont’d)
• Arrange the principal components for all j to obtain rotation matrix
• Rearrange the rows of R1 so as to match the order of features in F to obtain R1
a
• Build classifier D1 using XR1a as a training set
1
2
( )(1) (2)1,1 1,1 1,1
( )(1) (2)1,2 1,2 1,2
1
( )(1) (2)1, 1, 1,
, ,..., [0] [0]
[0] , ,..., [0]
[0] [0] , ,..., K
M
M
MK K K
a a a
a a aR
a a a
1 1 11 2
1 2
n
N N Nn
x x x
X
x x x
Hyperplane in a node of a Decision Tree
• In most case, axis-parallel hyperplane is employed in the decision tree.
• Oblique hyperplane can be better
• Axis-parallel or univariate or othorgonal
• Oblique or multi-variate
18
Oblique VS axis-parallel hyperplane
X>2
. Y>4
X>5 +
.Y>2
+
An example instance space,“+”“.”means different classes,
the stage shows the decision boundary of univariate DT and
the line shows the decision boundary of oblique DT
x
y
no
no
no
no
.
yes
yes
yes
19
A Toy Example of Oblique and axis-Parallel RF
20
Oblique Decision Tree (DT) ensemble Via
multisurface Proximal SVM (MP-SVM)
• Univariate (axis-parallel ) DT: finding a split amounts to finding an feature which is the most ‘useful ‘ in discriminating the input data.
• Multivariate (oblique) DT: finding a ‘composite’ feature, a combination of the existing features that has good discriminatory power.
• Suppose there are n examples with d features. For Uni-DT, there exist only d(n-1) hyperplanes. However, the number of the oblique hyperplanes is O(nd) .
• Exhaustive search which works quite good for uni-DT is impossible for Multi-DT.
Le Zhang and P. N. Suganthan, “Oblique Decision Tree Ensemble via MultisurfaceProximal Support Vector Machine,” IEEE Transactions on Cybernetics, Oct 2015. 10.1109/TCYB.2014.2366468 (Codes available online) 21
How to find the optimal hyperplane in
each internal node
• Heuristic search: hill-climbing, simulated annealing, Genetic algorithm
• Time-consuming, usually lead to sub-optimal solutions.
• For pattern classification, evolutionary methods are not effective.
22
SVM and its Variant: Multisurface
Proximal SVM (MP-SVM)
SVM finds two parallel hyperplanes that divide the feature spaces into three
disjoint parts. The data lies in between these two hyperplanes are linearly
inseparable. SVM classifies data by assigning them to one of the remaining
two disjoint half-spaces.
MP-SVM aims to find two clustering hyperplanes, where the first plane is
closest to the class 1 data & farthest from class 2 and the second plane is
closest to class2 and furthest from class 1. 23
Obtain the optimal hyperplane from MP-SVM
• An example of how to get the optimal hyperplane from the two clustering hyperplanes. The two red planes are the clustering planes and the two blue planes are the two angle bisectors of the two clustering planes. Here we can choose the bisector to act as a test. The data above the plane goes to one child node and the others goes to the other child node. Since there are two bisectors, we can choose the one which has better discriminant ability. Here the discriminant ability can be defined as some criteria such as info-gain or GINI impurity.
24
MPSVM• The first and the second clustering hyperplanes (W below) can be obtained
by solving the two following equations, where A represents the matrix (each row is a data sample) of the first class and B stands for the matrix of the second class, e is a vector of ones with the same dimension as AW and BW.
2
2
2
2
0),(
2
2
2
2
0),(
/||||
/||||
min
/||||
/||||
min
b
WebAW
b
WebBW
b
WebBW
b
WebAW
bW
bW
25
MPSVM
• By defining
G=[A -e]’[A -e], H=[B -e]’[B -e]
Z=[W b]’ the two problems become:
• Thus, the two clustering hyperplanes can be found by the eigenvectors corresponding to the smallest and largest eigenvalues of the following generalized eigenvalue problem:
0, zHzGz
0 0
Hmin , min
G
T T
T TZ Z
Z GZ Z Z
Z HZ Z Z
26
MPSVM based decision tree
• How to handle the multiclass problem since the MPSVM can only solve the binary classification problems.
• As the tree grows, the number of samples reaching lower nodes will become lesser, but the matrix G and H have the size of (M + 1) × (M + 1) always, So G and H may become singular.
• (One approach would be to use feature subspace to solve singularity problem and to reduce computation time)
27
MP-SVM Based Decision Tree
21
12
12
1
211221
2/)(ln
2
1
)(2
)(8
1),(
TwwB
28
Multiclass problem
C classes
Distance Matrix
Wp WnRemaining
class
Distance 1 Distance 2
Distance 1<Distance 2 Distance 1> Distance2
Largest distance
1,2 1,
2,1 2,
,1 ,2
0
0
0
c
c
c c
d d
d d
d d
C X C matrix
29
(group B)(Group A)
NULL space method
30
Solving the small sample size (S3) problem.
Null space approach
0min
T
TZ
Z GZ
Z HZ
1 1 1
' ' ' ' ' '
' ' '
[ , ]n
x H
Q
H QQ x xQQ QQ HQQ
G QQ GQQ
'
'0max
T
TZ
Z H Z
Z G Z
'
'
0
0;
max
T
T
Z
Z G Z
Z H Z
Eigenvector corresponding to the largest eigenvalue
Suppose G has rank r
0
Hmax
G
T
TZ
Z Z
Z Z
31
If G is singular:
Other methods to handle S3 problem
• NULL space method [1,2] is very sensitive to the data perturbations.
• Here, we apply two regularization approaches: Tikhonov regularization and axis-parallel split regularization.
[1] X. Jiang, “Linear subspace learning-based dimensionality reduction,” Signal
Processing Magazine, IEEE, vol. 28, no. 2, pp. 16–26, 2011.
[2] X. Jiang et al, “Eigenfeature regularization and extraction in face recognition,”
TPAMI, vol. 30, no. 3, pp. 383–394, 2008.
32
Other methods to handle S3
• Tikhonov regularization works by adding a constant term to the diagonal entries of the matrix to be regularized. In our case, suppose G becomes rank deficient, G is regularized by:
IGG *'
33
Other methods to handle S3
• If the matrix of G or H becomes singular at a node, we can always use the axis-parallel split to continue the tree induction process. Hence, the decision tree grows using heterogeneous test functions. From the root node to the current node, the decision tree uses the MPSVM to perform splits. From the current node to the leaf node, the decision tree switches to use the axis-parallel splitting method.
34
• Here we make no assumption about the tree structure.
• Suppose there are m data samples with n features.
• For axis-parallel split, m*n*log n(n*log n for each feature with Gini takes log n)
• For the MPSVM based decision tree , complexity of the generalized problem is of order m3
.
• In the upper level of nodes (near root node), MPSVM is much faster than axis-parallel split.
• In the lower level of nodes(near leaf node), the computational complexity is small (less data).
• MPSVM-T has slightly larger computational complexity than standard DT. MPSVM-P has the same computational complexity as the standard DT.
T – Tikhonov regularization & P – Axis parallel regularization
Computational Complexity
3
2
log
log
mn n m
n n m
35
Experiment result
36
Datasets
37
Le Zhang and P. N. Suganthan, “Oblique Decision Tree Ensemble via Multisurface Proximal Support Vector Machine,” IEEE Transactions on Cybernetics, Oct 2015. 10.1109/TCYB.2014.2366468 (Codes available online)
38
Experimental Results(1)
• Experimental settings:
1. Rotation Forest, and Random Forest were kept at their default values in WEKA
2. for Rotation Forest, M is fixed to be 3
3. all ensemble methods have the same ensemble size 50.
4. base DT classifier: CART (Breiman), Classification and Regression Tree.
5. database: UCI Machine Learning Repository, Bioinformatic dataset, Face recognition dataset
6. 10 3-fold cross validation.
Experimental Result: RaF
39
MPRaF-P: Multisurface Proximal Random Forest with
axis parallel regulrization.
Experimental Result: RoF
40
MPRoF-T: Multisurface Proximal Rotation Forest
with Tikhonov regularization
• RaF
• RoF
Discussion
0min
T
TZ
Z GZ
Z HZ
For RaF, H is (m+1)X(m+1). For default RoF, H is (M+1)X(M+1) with m~=sqrt(M).
41
Why MPSVM works for RaF?
Random Rotation Forest
• In order to solve this problem, we propose to employ random feature subspace in the base learner of Rotation Forest and name it as Random Rotation Forest (RRoF). In each node, the test function is evaluated on a randomly selected sub feature set instead of the whole feature set.
• In this case, Random Forest and Rotation Forests differ in the way that they perturb the data: Random Forest uses bagging to create a data subset and Rotation Forests employs different rotation matrices for different trees.
42
Experiment Result: RRoF
43
MPRRoF-N: Multisurface Proximal Random Rotation
Forest with Null space regularization
• In order to find out the statistical significance of the results, we carry out a Friedman test. It ranks the algorithms for each data set separately with the best performing algorithm getting the lowest rank. Let ri
j be the rank of the jth of k algorithms on the ith of N data sets. The Friedman test compares the average ranks of algorithms Rj. Under the null-hypothesis, which states that all the algorithms are equivalent and so their ranks Rj should be equal, the Friedman statistic
is distributed according to XF2 with k − 1 degrees of freedom, when N and k are big
enough. In that case, Friedman’s XF2 is undesirably conservative and derived a better
statistic:
Which is distributed according to the F-distribution with k − 1 and
(k − 1)(N − 1) degrees of freedom [1].
4
)1(
)1(
12 222 kk
Rkk
NX
j
jF
RaF vs RoF vs RRoF
2
2
)1(
)1(
F
FF
kN
NF
[1]. J. Demˇsar, “Statistical comparisons of classifiers over multiple data sets,”
JMLR, vol. 7, pp. 1–30, 2006.
44
RaF vs RoF vs RRoF• If the null-hypothesis is rejected, the Nemenyi test (Nemenyi,1963) can be
used to check whether the performance of two among k classifiers is significantly different. If the corresponding average ranks differ by at least the critical difference
we say there is a significant difference between 2classifiers.
N
kkqCD
6
)1(
45
RaF vs RoF vs RRoF
46
• The average ranks for RaF, RoF and RRoF are 2.24, 2.07, 1.69, respectively. Then XF
2 =6.98, FF =3.71. The critical value of F(2,86) for =0.05 is ~2.72, so we reject the null-hypothesis.
• In this case, CD=0.5.
RaF vs RoF vs RRoF
47
Which regularization works better ?
Method RaF(2.95) MPRaF-T (1.97) MRPaF-P(2.11) MPRaF-N(2.97)
RaF(2.95)
MPRaF-T(1.97) √ √
MPRaF-P(2.11) √ √
MPRaF-N(2.97)
Method RRoF(2.91) MPRRoF-T (2.19) MPRRoF-P(2.15) MPRRoF-N(2.75)
RRoF(2.91)
MPRRoF-T(2.19) √
MPRRoF-P(2.15) √
MPRRoF-N(2.75)
The numbers in the bracket represent the average rank for the algorithm, √means the method of the row is significantly better than the method of of the
column. Empty entry means there is no significant difference between the method
of row and the method of the column. 48
Bias-Variance point of view: Bias
Bias-RaF
the smallest value of bias/variance gets rank 1, the largest value gets highest rank.
Groups of methods with the value that are not significantly different (alpha = 0.05)
are connected.
Bias-RRoF
49
1
• For RaF, MPRaF-T and MPRaF-P generates lower bias than the others and MPRaF-P is slightly better, which demonstrates that MPSVM can better capture the geometric structure of the data.
2• For RRoF, the MPRRoF-N is significantly worse than the others.
3• This further indicates that Rotation Forest tends to generate
base classifiers with slightly lower-bias than Random Forest..
Discussion about the Bias
50
Bias-Variance point of view-Variance
Var-RaF
Var-RRoF
For all the
ensembles, no
significant
difference is
detected among
the variance of
the base
classifiers.
51
Discussion about the Variance
1
• For RF and its MPSVM based variants, though there are no significant difference among their variances, the MPRaF tend to reduce the variance to a larger extent, especially for MPRaF-T and MPRaF-P.
2• The exactly same conclusion can be drawn for RRoF
and its MPSVM based variants.
52
For a given regularization approach,
which ensemble method is better?
• we use Sign Test to compare each pair of algorithms.
• If the two algorithms compared are, as assumed under the null-hypothesis, equivalent, each should win on approximately N/2 out of N data sets: if the number of wins is at least N/2 + 1.96sqrt(N)/2, the algorithm is significantly better with p < 0.05
53
Method Significance
(MPRaF-T,MPRRoF-T) (18,26)
(MPRaF-P,MPRRoF-P) (16,28)
(MPRaF-N,MPRRoF-N) (12,32) √
For a given regularization approach, which
ensemble method is better?
The first number in the bracket represents the number of
times RaF wins, the second number means the number of
times RRoF wins. √ means there is significant difference
between this pair of algorithms.
54
• The parameter m denotes the number of features randomly selected at each node
• the smaller m is, the stronger the randomization of the trees and the weaker the dependence of their structures on the output
• However, if m is small, the features randomly selected at a node may fail to capture the geometry of the data samples
On the effect of m
55
The effect of m for “ parkinsons” dataset
RaF RRoF
56
The effect of m for “ wine quality (Wine)” dataset
RaF RRoF
57
• For very small value of m, the accuracy of all the ensemble methods are very low, especially for the MPSVM based ensembles
• However, as the m grows, the accuracies of all MPSVM based ensembles grow significantly and become stable very quickly except for the MPSVM with NULL space regularization.
Discussion about the parameter m
58
• ensemble methods with other base classifiers
• ensemble with deep learning, big data, etc.
• application for ensemble methods
Future Work
59
• We choose MPRaF-P and compare with other 179 classifier on 121 UCI dataset:
Fernández-Delgado, Manuel, et al. "Do we need hundreds of classifiers to solve real world classification problems?." The Journal of Machine Learning Research 15.1 (2014): 3133-3181.
More Comprehensive Evaluation
60
Top 2 are our proposed methods with axis parallel regularization (P)
61
Strengths & Weaknesses of RF
Easy to parallelize
Excellent batch mode performance
Online learning
Not friendly for transfer learning, visual feature extraction.
62
Neural Networks
• Theoretical proof about the approximation ability of standard multilayer feedforward network can be found in:
Hornik, Kurt, Maxwell Stinchcombe, and Halbert White. "Multilayer feedforward networks are universal approximators." Neural networks 2.5 (1989): 359-366.
• Some conditions are:
Arbitrary squashing functions.
Sufficiently many hidden units are available.
• Standard multilayer feedforward network can approximate virtually any function to any desired degree of accuracy.
63
Weakness and Improvement
• Some weakness of back-prop:
Error surface usually have multiple local minima.
Slow convergence.
Sensitivity to learning rate setting.
• Improvement:
parameters in hidden layers can be randomly and appropriately
generated without learning.
The parameters in the last layer can be computed by least squares.
Reference:
Wouter F Schmidt, Martin Kraaijveld, Robert PW Duin, et al. “Feedforward neural networks with random
weights”. In: 1992 11th IAPR International Conference on Pattern Recognition. IEEE. 1992, pp. 1–4
Yoh-Han Pao, Gwang-Hoon Park, and Dejan J Sobajic. “Learning and generalization characteristics of the
random vector functional-link net”. In: Neurocomputing 6.2 (1994), pp. 163–180.
64
Structure of RVFL
Parameters H (indicated in blue) in enhancement nodes are randomly generated in a
proper range and kept fixed.
Original features (from the input layer) are concatenated with enhanced features (from
hidden layer) to boost the performance.
Learning aims at solving yi = di ∗ W, i = 1, 2, ..., n, where n is the number of training
sample, W (indicated in red and gray) are the weights in the output layer, and y, d
represent the target and the combined features, respectively.65
RVFL details• Notations:
• X = [x1, x2, ...xn]’ : input data (n samples and m features).
• β = [β1, β2, ...βm]’ : the weights for the enhancement nodes (m × k, k is the number of enhancement nodes).
• b = [b1, b2, ...bk ] the bias for the enhancement node.
• H = h(X ∗ β + ones(n, 1) ∗ b): feature matrix after the enhancement nodes and h is the activation function.
66
Tribas – triangular basis
Relu - rectified linear Unit
RVFL details
67
d(x1)
d(x2)
…
…
…
d(xn)
=
Learning in RVFL
• Once the random parameters β and b are generated in proper range, the learning of RVFL aims at solving the following problem: yi = di ∗ W, i = 1, 2, ..., n, where n is the number of training sample, W are the weights in the output layer, and y, d represent the target and the combined features.
• Objective: min ||DW-Y||2 + λ ||W||2
• W
• Solutions:
In prime space: W = (λI + D’D) -1D’Y;
In dual space: W = D’(λI + DD’ )-1Y
where λ is the regularization parameter, when λ → 0, the methods converges to the Moore-Penrose pseudoinverse solution. D and Y are the stacked versions of the features and target.
68
Comprehensive Evaluation of RVFL
69
L. Zhang, P. N. Suganthan, "A Survey of Randomized Algorithms for
Training Neural Networks," Information Sciences, DoI: 10.1016/j.ins.2016.01.039,
Volumes 364–365, pp.146–155, Oct, 2016
Evaluation Protocol
• Follow the exact pipeline in JMLR paper.
• Friedman rank for each method is used to evaluate the classifier.
L. Zhang, P. N. Suganthan, "A Comprehensive Evaluation of Random Vector Functional Link Networks," Information Sciences, DOI:10.1016/j.ins.2015.09.025, Volumes 367–368, pp. 1094–1105, Nov 2016. (Codes Available: 2016-RVFL-Comp-Eval-Classification)
70
Results: direct link and bias
The direct links lead to better performance than those without in all cases.
The bias term in the output neurons only has mixed effects on the performance,
as it may or may not improve performance.
71
Results: Activation Function.
radbas function always leads to a better performance. hardlim and
sign activation functions lead to penultimate worst and worst
performances, respectively.
72
Results: Moore–Penrose
Pseudoinverse VS Ridge Regression
ridge regression based RVFL shows better performance than
the Moore–Penrose pseudoinverse based RVFL
73
Results: Scaling the Randomization
range of input weights and biases
74
Performance of RVFL based for different ranges of randomization. Smaller rank indicates
better accuracy and less number of hidden neurons. N stands for the number of hidden
neurons corresponding to the testing accuracy used in the ranking. In other words for each
ranking performance is maximized and corresponding number of neurons recorded.
75
Results: Scaling the randomization
range of input weights and biases
• Scaling down the randomization range of input weights and biases to avoid saturating the neurons may risk at degenerating the discrimination power of the random features. However, this can be compensated by having more hidden neurons or direct link.
• Scaling the randomization range of input weights and biases up to enhance the discrimination power of the random features may risk saturating the neurons. Again, this can be compensated by having more hidden neurons or combining with the direct link from the input to the output layer. However, for reasons explained in Section 2.5, we prefer lower model complexity, i.e. lower number of hidden neurons.
76
Ridge Regression
Linear regression :
Its solution:
Consider the feature map below which maps original feature to
higher dimension to enhance the discriminability, i.e. kernel trick:
n
n
i
T
i wwyn 1
22 ||||)x(1
Kernel Ridge Regression
According to representer theorem, one can express the solution
as a linear combination of the samples:
Objective with kernel trick:
The solution is:
Output
78
KRR for Classification
• Through the commonly used 0-1 coding:
C. Saunders, A. Gammerman and V. Vovk, "Ridge Regression Learning Algorithm in
Dual Variables", in Proc ICML 1998.
S. An, W. Liu, and S. Venkatesh, 2007, "Face recognition using kernel ridge regression",
in CVPR 2007 : Proceedings of the IEEE Computer Society Conference on Computer
Vision and Pattern Recognition, IEEE, Piscataway, N.J, pp. 1-7, 2007.
Overall performance compared with JMLR Results1
Top 10 classifier Top 11-20 classifier
Our proposed method.
Reference: Fernández-Delgado, Manuel, et al. "Do we need hundreds of classifiers to
solve real world classification problems?." The Journal of Machine Learning
Research 15.1 (2014): 3133-3181.
References
• Zhang, Le, and Ponnuthurai Nagaratnam Suganthan. "Random forests with ensemble of feature spaces." Pattern Recognition 47.10 (2014): 3429-3437.
• Zhang, Le, and Ponnuthurai N. Suganthan. "Oblique decision tree ensemble via multisurface proximal support vector machine." Cybernetics, IEEE Transactions on 45.10 (2015): 2165-2176.
• Zhang, Le, Ren, Ye, and P. N. Suganthan. "Ensemble Classification and Regression-Recent Developments, Applications and Future Directions [Review Article]." Computational Intelligence Magazine, IEEE 11.1 (2016): 41-53.
• Zhang, Le, and P. N. Suganthan. "A comprehensive evaluation of random vector functional link networks." Information Sciences (2015).
• Zhang, Le, and P. N. Suganthan. "A Survey of Randomized Algorithms for Training Neural Networks." Information Sciences (2016).
81
Time Series Forecasting in
Renewable Energy Systems
82
Outline
• Introduction– Time Series– Computational Intelligence– Ensemble Methods
• Empirical Mode Decomposition (EMD)– EMD-SVR– Ensemble EMD– Application: Wind speed forecasting
• Random Vector Functional Link (RVFL) Network– RVFL structural configurations– Application: Load demand forecasting– Application: Wind power ramp classification
83
Outline
• Introduction– Time Series– Computational Intelligence– Ensemble Methods
• Empirical Mode Decomposition (EMD)– EMD-SVR– Ensemble EMD– Application: Wind speed forecasting
• Random Vector Functional Link (RVFL) Network– RVFL structural configurations– Application: Load demand forecasting– Application: Wind power ramp classification
84
Introduction – Time Series
A time series is a sequence of data points that1. Consists of successive measurements made over a time interval2. The time interval is continuous3. The distance in this time interval between any two consecutive data point is the same4. Each time unit in the time interval has at most one data point5. Univariate / multivariate6. Classification or forecasting.
85
Introduction – Applications
Finance Energy
86
Time Series Forecasting
• Historical value => future value
• Current observations => future value
• Hybrid of the above two
• Statistical approach
• Physical approach based on differential eqns.
• Computational intelligence based approach
• Hybrid/ensemble approach
Y. Ren, L. Zhang, and P. N. Suganthan, “Ensemble Classification and Regression –
Recent Developments, Applications and Future Directions,” IEEE Comput. Intell.
Mag., 2016, doi: 10.1109/MCI.2015.2471235
Time Series Forecasting (Cont’d)
• Wind speed/power forecasting
• Electricity Load demand forecasting
• Electricity Price forecasting
• Solar irradiance/power forecasting
• Load/wind/solar power ramp forecasting, etc.
88
Outline
• Introduction– Time Series– Computational Intelligence– Ensemble Methods
• Empirical Mode Decomposition (EMD)– EMD-SVR– Ensemble EMD– Application: Wind speed forecasting
• Random Vector Functional Link (RVFL) Network– RVFL structural configurations– Application: Load demand forecasting– Application: Wind power ramp classification
89
Empirical Mode Decomposition (EMD)
Finally:
IMF: Intrinsic Mode FunctionsN. Huang, Z. Shen, S. Long, M. Wu, H. Shih, Q. Zheng, N. Yen, C. Tung, and H. Liu, “The
empirical mode decomposition and Hilbert spectrum for nonlinear and nonstationary time series
analysis,” Proc. Royal Society London A, vol. 454, pp. 903–995, 1998.
Empirical Mode Decomposition (EMD)
Finally:
N. Huang, Z. Shen, S. Long, M. Wu, H. Shih, Q.
Zheng, N. Yen, C. Tung, and H. Liu, “The
empirical mode decomposition and Hilbert
spectrum for nonlinear and nonstationary time
series analysis,” Proc. Royal Society London A,
vol. 454, pp. 903–995, 1998.
Empirical Mode Decomposition (EMD)
(cont’d)
92
Empirical Mode Decomposition (EMD)
(cont’d)
• Adaptive, Local, Orthogonal, Completeness
• Decompose complex time series into simpler time series (narrow band, symmetric)
• Reveal hidden features/correlations of the time series
• Mode mixing problem:– a sub series consists of
signal spanning a wide band of frequency or
– more than one sub series contain signals in a similar frequency band
• Ensemble of EMD => solve the problem
G. Rilling, P. Flandrin, and P. Gonc¸alv`es, “On empirical mode decomposition and its algorithms,” in
Proc. IEEE-EURASIP Workshop on Nonlinear Signal and Image Processing (NSIP’03), no. 3, Grado,
Italy, 2003, pp. 8–11.93
Ensemble of EMD
• Ensemble EMD (EEMD)– Add uncorrelated
Gaussian noise to the original time series,
– Repeat EMD to the noise added series
– Combine the results: noise will cancel each other
– But completeness is violated
Z. Wu and N. E. Huang, “Ensemble empirical mode decomposition: a noise-assisted
data analysis method,” Advances in Adaptive Data Analysis, vol. 1, no. 1, pp. 1–41,
2009.
Ensemble of EMD (cont’d)
• Complementary EEMD (CEEMD)
– Gaussian noise are in pair and complementary to each other
– Completeness is retained
– Needs more trials (ensembles)
J.-R. Yeh, J.-S. Shieh, and N. E. Huang, “Complimentary ensemble empirical mode decomposition: a novel
noise enhanced data analysis method,” Advances in Adaptive Data Analysis, vol. 2, no. 2, pp. 135– 156,
2010. 95
Ensemble of EMD (cont’d)
• Complete EEMD with Adaptive Noise (CEEMDAN)
– Adaptive noise
– Sequential process
– Reduce number of trials (ensembles)
– But cannot do parallel computing
M. Torres, M. Colominas, G. Schlotthauer, and P. Flandrin, “A complete ensemble
empirical mode decomposition with adaptive noise,” in Proc. IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP’11), no. 1520-6149,
Prague, Czech Republic, 22-27 May 2011.
Study on EMD-SVR/ANN for wind speed
forecasting
• Forecasting models: – EMD-ANN, EEMD-ANN, CEEMD-ANN, CEEMDAN-ANN,
– EMD-SVR, EEMD-SVR, CEEMD-SVR, and CEEMDAN-SVR.
• 12 wind speed TS datasets obtained from National Data Buoy Center(NDBC).
• 70% for training and 30% for testing.
• Forecasting horizon: 1, 3 and 5 hours ahead.
• Scaled to (0,1] interval.
• Compare on RMSE/MASE
• CEEMDAN: Complementary
ensemble empirical mode decomposition
with adaptive noise.Y. Ren, P. N. Suganthan, and N. Srikanth, “A comparative study of empirical mode decomposition-based short-term wind speed forecasting methods,” IEEE Trans.
Sustain. Energy, vol. 6, no. 1, pp. 236--244, Jan. 2015.
RMSE
Study on EMD-SVR/ANN for wind speed
forecasting (cont’d)
Friedman Test of EMD based Hybrid Methodsfor 1, 3 and 5 hour ahead Wind Speed Forecasting
CPU Time (s) of EMD, EEMD, CEEMD and CEEMDANon Wind Speed TS
Concluding remarks
• EMD based hybrid SVR methods outperformed the persistence method.
• EMD based hybrid ANN methods outperformed the persistence method for 1 and 3 hour ahead forecasting only.
• EMD based hybrid SVR methods outperformed the SVR for 1, 3 and 5 hour ahead forecasting.
• EMD-ANN method had significantly worse performance than the ANN.
• EEMD/CEEMD/CEEMDAN-ANN methods had comparable performance as the ANN.
• CEEMDAN-SVR and EEMD-SVR outperformed the CEEMD-SVR and EMD-SVR for 1, 3 and 5 hour ahead forecasting.
• CEEMDAN-ANN, the EEMD-ANN and the CEEMD-ANN outperformed the EMD-ANN.
• In general, the EMD based hybrid SVR methods had better performance than the EMD based hybrid ANN methods although the SVR and ANN methods had similar performance. => EMD and its improved versions enhanced the performance of SVR on time series forecasting.
• By considering the CPU time and the number of decomposed sub series, CEEMDAN-SVR is the best performed method for wind speed time series forecasting.
Y. Ren, P. N. Suganthan, and N. Srikanth, “A comparative study of empirical mode
decomposition-based short-term wind speed forecasting methods,” IEEE Trans.
Sustain. Energy, vol. 6, no. 1, pp. 236--244, Jan. 2015.
Outline
• Introduction– Time Series– Computational Intelligence– Ensemble Methods
• Empirical Mode Decomposition (EMD)– EMD-SVR– Ensemble EMD– Application: Wind speed forecasting
• Random Vector Functional Link (RVFL) Network– RVFL structural configurations– Application: Load demand forecasting– Application: Wind power ramp classification
100
Random vector functional link (RVFL)
Network
Y.-H. Pao, G.-H. Park, D. J. Sobajic, Learning and generalization characteristics of the random vector functional-
link net, Neurocomputing 6 (2) (1994) 163–180.
RVFL variations
• Effects on:– Input layer bias
– Hidden layer bias
– Direct input-output connections
Y. Ren, P. N. Suganthan, N. Srikanth and G. Amaratunga, “Single Hidden Layer Neural Networks with Random Weights for
Short-term Electricity Load Demand Forecasting,” Information Sciences, 2016.
Schmidt W F, Kraaijveld M A, Duin R P W. Feedforward neural networks with random weights[C]//Pattern Recognition, 1992. Vol.
II. Conference B: Pattern Recognition Methodology and Systems, Proceedings., 11th IAPR International Conference on. IEEE,
1992: 1-4.
RVFL variations (cont’d)
Input layer bias
Hidden layer bias
Input output
link
RVFL for load demand forecasting
• Input layer bias and hidden layer bias insignificantly affected the performance
• whereas the direct input– output connections significantly improved the performance
• Quantile scaling algorithm has improved the performance for 1–4 hour and 18–24 hour ahead forecasting horizons.
• Feature selection based on partial auto correlation function or seasonal auto-regression has consistently degraded the performance on the seasonal time series.
Random Forest for TS forecasting
• Same concept as classification but– Classification tree => regression tree
– Majority vote => averaging or selecting median for the final output
• Important parameters: – n_tree: number of bootstrap samples
– m_try: number of variables tried at each split
• Criteria at split:
– Residual sum of square: – Or minimize within group variance
– Or maximize between group variance
RVFL for load demand forecasting
(cont’d)
Performance
Training time
Testing time
106
RVFL for load demand forecasting
(cont’d)
• No clear overall advantage of input layer and hidden layer biases. However, the input layer biases are necessary for the neural networks to function properly as a universal approximator. => recommend retaining biases as the selection choices as they may be beneficial for some forecasting problems.
• Compared with reported non-ensemble forecasting methods such as the persistence method, seasonal ARIMA and artificial neural networks, the RVFL network has significantly better performance.
• The RVFL network is underperformed by random forest, which is an ensemble method.
• The computation time (including training and testing) of the RVFL network is the shortest compared with the reported methods.
107
Wind Power Ramp Forecasting with
RVFL
Based on
local
extrema
Based on
end points
Y. Ren, X. Qiu, P. N. Suganthan, N. Srikanth, G. Amaratunga, “Detecting Wind Power
Ramp with Random Vector Functional Link (RVFL) Network,” IEEE Symposium on
Computational Intelligence and Ensemble Learning (IEEE CIEL'15), Dec. 2015.
Wind Power Ramp Forecasting with
RVFL (cont’d)
• Noises caused by:
– Wind gust
– Wind turbine maintenance / shutdown
– Sensor noise or fault
• De-noise:
– Smooth outlier: median absolute deviation (MAD) rejection
– Missing data: extrapolation
– White noise: empirical mode decomposition (EMD)
109
Wind Power Ramp Forecasting with RVFL
(cont’d)• Imbalanced data:
– Wind ramp forecasting => classification problem
– Ramp: minority class, no-ramp: majority class => imbalanced dataset
– Up-scaling or down-scaling to recover from imbalance
• Error measure:
110
Experiment setup
Wind power time series
De-noise
Convert to power ramp (+1: significant ramp, -1: no significant ramp)
Over sampling to avoid imbalance
RVFL classification
Results and Discussion
A Fraction of Wind Power Generated in an ELIA wind farm, red segments denote power ramps
Summary of the Datasets for Wind Ramp Classification
112
Results and Discussion (cont’d)
Friedman Rank Sum Test on the Performance Measures of the Four classification Methods
Nemenyi Post-hoc Test on the Performance Measures of the Four Classification Methods with 12 Hour Window Forecasting
113
Concluding remarks
• Wind power ramp forecasting with RVFL network
• Two wind power definitions: based on (i) local extrema and (ii) end points
• RVFL network has comparable performance as ANN, RF and SVR for 6 hour ahead forecasting
• RVFL network has better performance than the SVM for 12 hour ahead forecasting
• RVFL network has significant time advantage over SVM and ANN and is comparable as RF
Conclusions
• Empirical Mode Decomposition based SVR: improve Time series forecasting accuracy
• Ensemble of EMD: CEEMDAN-SVR has best performance and short computation time
• RVFL network with input bias, without hidden bias, with direct input—output connections is the best configuration
• RVFL network has significantly better performance than non-ensemble methods for wind speed TS forecasting
• RVFL network has better performance and shorter time for wind power ramp forecasting
• RF is also highly competitive.
115
Future work
• Ensemble of EMD with multi-variate time series
• Ensemble of EMD with complex valued time series
• Ensemble of RVFL network
• Investigate in details on the feature space after enhancement node transformation
• Deep Learning ensembles
• Transfer Learning
116
Contributed by:Zhang Le (PhD 2016)
Ren Ye (PhD 2016)
Thank You !
117