Ensemble Based Classification and Forecasting...

Ensemble Based Classification and

Forecasting Methods

Dr P. N. SuganthanSchool of EEE, NTU, Singapore

Some Software Resources Available from:

http://www.ntu.edu.sg/home/epnsugan

TENCON 2016

MBS Singapore, 22nd Nov. 2016

General Concept

S Training

Data

S1 S2 SnMultiple Data Sets

P1 P2 PnMultiple

Predictors

HCombined Decision

Y. Ren, L. Zhang, and P. N. Suganthan, "Ensemble Classification and Regression –

Recent Developments, Applications and Future Directions,“ IEEE Computational

Intelligence Magazine, DOI: 10.1109/MCI.2015.2471235 , Feb 2016.

2

Outline• Classification

– Random forests

– Random Vector Functional Link

– Kernel Ridge Regression

– Benchmarking

• Forecasting– Random vector functional link networks

– Random forests

Non-iterative methods

3

4

Bagging(1)

Sampling from original training data (without

deletion) Usually to the same number of samples

in each bag, as in the original dataset. Around 65%

of distinct samples and 35% duplicates.

Original training data

Bags

Same number of samples

Bagging(2)

1. For i=1: No.Trs // No.Trs is number of trees or bags

a) Draw (with replacement) from the training set to generate the training data Si

b) Learn a classifier Ci on Si .

2. For each test example

a) Try all classifiers Ci

b) Predict the class that receives the highest number of votes.

5

Random Forests

L. Breiman, “Random forests,” Machine learning, vol. 45, no. 1, pp. 5–32, 2001.

Patented with restrictions on commercial usage.6

terminal node.

using d

Decision Tree:

An Example

Classifying an email as either as

spam or a valid email using

Features such as ch! , ch$, 1999,

george, free, our, receive, …

Each node is labeled based on

majority training samples reaching

the node. Training samples of

each class is indicated as x/y

below each node.

A node of decision Tree

1,1 1,2 1,

2,1 2,2 2,

3,1 3,2 3,

1,1 1,2 1,

,1 ,2 ,

d

d

d

n n n d

n n n d

X X X

X X X

X X X

X X X

X X X

Each row is a

data point

Each column is a feature. There

are M features per training sample.

d features are selected randomly.

Threshold

How to optimize the

threshold for each

feature?

Assume that n training samples (each with M features) reach a node.

8

• Gini impurity, the threshold should be optimized so that is maximized.

Optimize threshold for each feature

2

1

2 2

1 1

( ) 1 ( )

( ) [1 ( ) ] [1 ( ) ]

i

i i

cw

beforesplit

i

l rl rc cw w

aftersplit l ri i

nGini t

n

n nn nGini t

n n n n

beforesplit aftersplitGini Gini

There are other impurity criteria such

as information gain, etc.

C -total number of classes at the node

n – total samples

r - right branch

l – left branch

nwi- samples of class wi

9

Rotation Forest

NxM M is the dimensionality of each data.

Rotation Forest

• A distinct rotation matrix for each tree

• All features, all samples are rotated at the root level.

• In the default version, all rotated features are used in each node.

L. Zhang, P N Suganthan, “Random Forests with Ensemble of Feature Spaces,” Pattern Recognition, 47 (10), 3429-3437, 2014. (Codes Available: 2015-TCyb-Oblique-RF

11

http://web.mysites.ntu.edu.sg/epnsugan/PublicSite/Shared Documents/Forms/AllItems.aspx?RootFolder=%2fepnsugan%2fPublicSite%2fShared%20Documents%2fCodes&FolderCTID=&View=%7bDAF31868%2d97D8%2d4779%2dAE49%2d9CEC4DC3F310%7d

12

Rotation Forest Illustrated

13

Rotation Forest(2)

• Base classifiers: decision trees Forest

PCA is a simple rotation of the coordinate axes Rotation Forest

14

Method for Constructing Rotation Matrix

X: the objects in the training data set

x = [x1, x2, …, xM

]T

a data point with M

features

1 1 11 2

1 2

n

N N Nn

x x x

X

x x x

N×M matrix

Y = [y1, y2, …, yN]T

: class labels with c classes

15


• For i = 1 … L (to construct the training set for

classifier Di)

F : full feature set

Fi,1

Fi,2

Fi,3

…Fi,K

K subsets (Fi,j j=1…K)

each has m = M/K features

16


• For j = 1 … K

F1,1F1,2

F1,3

…F1,K

X1,1: data set X for the features in F1,1

Eliminate a random

subset of data

Select a bootstrap sample

from X1,1 to obtain X’1,1

Run PCA on X’1,1 using only m (default m=3) features

Principal components a(1)

1,1,…,a(m1)

1,1

17

Method … (Cont’d)

• Arrange the principal components for all j to obtain rotation matrix

• Rearrange the rows of R1 so as to match the order of features in F to obtain R1

a

• Build classifier D1 using XR1a as a training set

1

2

( )(1) (2)1,1 1,1 1,1

( )(1) (2)1,2 1,2 1,2

1

( )(1) (2)1, 1, 1,

, ,..., [0] [0]

[0] , ,..., [0]

[0] [0] , ,..., K

M

M

MK K K

a a a

a a aR

a a a

1 1 11 2

1 2

n

N N Nn

x x x

X

x x x

Hyperplane in a node of a Decision Tree

• In most case, axis-parallel hyperplane is employed in the decision tree.

• Oblique hyperplane can be better

• Axis-parallel or univariate or othorgonal

• Oblique or multi-variate

18

Oblique VS axis-parallel hyperplane

X>2

. Y>4

X>5 +

.Y>2

+

An example instance space,“+”“.”means different classes,

the stage shows the decision boundary of univariate DT and

the line shows the decision boundary of oblique DT

x

y

no

no

no

no

.

yes

yes

yes

19

A Toy Example of Oblique and axis-Parallel RF

20

Oblique Decision Tree (DT) ensemble Via

multisurface Proximal SVM (MP-SVM)

• Univariate (axis-parallel ) DT: finding a split amounts to finding an feature which is the most ‘useful ‘ in discriminating the input data.

• Multivariate (oblique) DT: finding a ‘composite’ feature, a combination of the existing features that has good discriminatory power.

• Suppose there are n examples with d features. For Uni-DT, there exist only d(n-1) hyperplanes. However, the number of the oblique hyperplanes is O(nd) .

• Exhaustive search which works quite good for uni-DT is impossible for Multi-DT.

Le Zhang and P. N. Suganthan, “Oblique Decision Tree Ensemble via MultisurfaceProximal Support Vector Machine,” IEEE Transactions on Cybernetics, Oct 2015. 10.1109/TCYB.2014.2366468 (Codes available online) 21

http://dx.doi.org/10.1109/TCYB.2014.2366468

How to find the optimal hyperplane in

each internal node

• Heuristic search: hill-climbing, simulated annealing, Genetic algorithm

• Time-consuming, usually lead to sub-optimal solutions.

• For pattern classification, evolutionary methods are not effective.

22

SVM and its Variant: Multisurface

Proximal SVM (MP-SVM)

SVM finds two parallel hyperplanes that divide the feature spaces into three

disjoint parts. The data lies in between these two hyperplanes are linearly

inseparable. SVM classifies data by assigning them to one of the remaining

two disjoint half-spaces.

MP-SVM aims to find two clustering hyperplanes, where the first plane is

closest to the class 1 data & farthest from class 2 and the second plane is

closest to class2 and furthest from class 1. 23

Obtain the optimal hyperplane from MP-SVM

• An example of how to get the optimal hyperplane from the two clustering hyperplanes. The two red planes are the clustering planes and the two blue planes are the two angle bisectors of the two clustering planes. Here we can choose the bisector to act as a test. The data above the plane goes to one child node and the others goes to the other child node. Since there are two bisectors, we can choose the one which has better discriminant ability. Here the discriminant ability can be defined as some criteria such as info-gain or GINI impurity.

24

MPSVM• The first and the second clustering hyperplanes (W below) can be obtained

by solving the two following equations, where A represents the matrix (each row is a data sample) of the first class and B stands for the matrix of the second class, e is a vector of ones with the same dimension as AW and BW.

2

2

2

2

0),(

2

2

2

2

0),(

/||||

/||||

min

/||||

/||||

min

b

WebAW

b

WebBW

b

WebBW

b

WebAW

bW

bW

25

MPSVM

• By defining

G=[A -e]’[A -e], H=[B -e]’[B -e]

Z=[W b]’ the two problems become:

• Thus, the two clustering hyperplanes can be found by the eigenvectors corresponding to the smallest and largest eigenvalues of the following generalized eigenvalue problem:

0, zHzGz

0 0

Hmin , min

G

T T

T TZ Z

Z GZ Z Z

Z HZ Z Z

26

MPSVM based decision tree

• How to handle the multiclass problem since the MPSVM can only solve the binary classification problems.

• As the tree grows, the number of samples reaching lower nodes will become lesser, but the matrix G and H have the size of (M + 1) × (M + 1) always, So G and H may become singular.

• (One approach would be to use feature subspace to solve singularity problem and to reduce computation time)

27

MP-SVM Based Decision Tree

21

12

12

1

211221

2/)(ln

2

1

)(2

)(8

1),(

TwwB

28

Multiclass problem

C classes

Distance Matrix

Wp WnRemaining

class

Distance 1 Distance 2

Distance 1<Distance 2 Distance 1> Distance2

Largest distance

1,2 1,

2,1 2,

,1 ,2

0

0

0

c

c

c c

d d

d d

d d

C X C matrix

29

(group B)(Group A)

NULL space method

30

Solving the small sample size (S3) problem.

Null space approach

0min

T

TZ

Z GZ

Z HZ

1 1 1

' ' ' ' ' '

' ' '

[ , ]n

x H

Q

H QQ x xQQ QQ HQQ

G QQ GQQ

'

'0max

T

TZ

Z H Z

Z G Z

'

'

0

0;

max

T

T

Z

Z G Z

Z H Z

Eigenvector corresponding to the largest eigenvalue

Suppose G has rank r

0

Hmax

G

T

TZ

Z Z

Z Z

31

If G is singular:

Other methods to handle S3 problem

• NULL space method [1,2] is very sensitive to the data perturbations.

• Here, we apply two regularization approaches: Tikhonov regularization and axis-parallel split regularization.

[1] X. Jiang, “Linear subspace learning-based dimensionality reduction,” Signal

Processing Magazine, IEEE, vol. 28, no. 2, pp. 16–26, 2011.

[2] X. Jiang et al, “Eigenfeature regularization and extraction in face recognition,”

TPAMI, vol. 30, no. 3, pp. 383–394, 2008.

32

Other methods to handle S3

• Tikhonov regularization works by adding a constant term to the diagonal entries of the matrix to be regularized. In our case, suppose G becomes rank deficient, G is regularized by:

IGG *'

33

Other methods to handle S3

• If the matrix of G or H becomes singular at a node, we can always use the axis-parallel split to continue the tree induction process. Hence, the decision tree grows using heterogeneous test functions. From the root node to the current node, the decision tree uses the MPSVM to perform splits. From the current node to the leaf node, the decision tree switches to use the axis-parallel splitting method.

34

• Here we make no assumption about the tree structure.

• Suppose there are m data samples with n features.

• For axis-parallel split, m*n*log n(n*log n for each feature with Gini takes log n)

• For the MPSVM based decision tree , complexity of the generalized problem is of order m3

.

• In the upper level of nodes (near root node), MPSVM is much faster than axis-parallel split.

• In the lower level of nodes(near leaf node), the computational complexity is small (less data).

• MPSVM-T has slightly larger computational complexity than standard DT. MPSVM-P has the same computational complexity as the standard DT.

T – Tikhonov regularization & P – Axis parallel regularization

Computational Complexity

3

2

log

log

mn n m

n n m

35

Experiment result

36

Datasets

37

Le Zhang and P. N. Suganthan, “Oblique Decision Tree Ensemble via Multisurface Proximal Support Vector Machine,” IEEE Transactions on Cybernetics, Oct 2015. 10.1109/TCYB.2014.2366468 (Codes available online)

http://dx.doi.org/10.1109/TCYB.2014.2366468

38

Experimental Results(1)

• Experimental settings:

1. Rotation Forest, and Random Forest were kept at their default values in WEKA

2. for Rotation Forest, M is fixed to be 3

3. all ensemble methods have the same ensemble size 50.

4. base DT classifier: CART (Breiman), Classification and Regression Tree.

5. database: UCI Machine Learning Repository, Bioinformatic dataset, Face recognition dataset

6. 10 3-fold cross validation.

Experimental Result: RaF

39

MPRaF-P: Multisurface Proximal Random Forest with

axis parallel regulrization.

Experimental Result: RoF

40

MPRoF-T: Multisurface Proximal Rotation Forest

with Tikhonov regularization

• RaF

• RoF

Discussion

0min

T

TZ

Z GZ

Z HZ

For RaF, H is (m+1)X(m+1). For default RoF, H is (M+1)X(M+1) with m~=sqrt(M).

41

Why MPSVM works for RaF?

Random Rotation Forest

• In order to solve this problem, we propose to employ random feature subspace in the base learner of Rotation Forest and name it as Random Rotation Forest (RRoF). In each node, the test function is evaluated on a randomly selected sub feature set instead of the whole feature set.

• In this case, Random Forest and Rotation Forests differ in the way that they perturb the data: Random Forest uses bagging to create a data subset and Rotation Forests employs different rotation matrices for different trees.

42

Experiment Result: RRoF

43

MPRRoF-N: Multisurface Proximal Random Rotation

Forest with Null space regularization

• In order to find out the statistical significance of the results, we carry out a Friedman test. It ranks the algorithms for each data set separately with the best performing algorithm getting the lowest rank. Let ri

j be the rank of the jth of k algorithms on the ith of N data sets. The Friedman test compares the average ranks of algorithms Rj. Under the null-hypothesis, which states that all the algorithms are equivalent and so their ranks Rj should be equal, the Friedman statistic

is distributed according to XF2 with k − 1 degrees of freedom, when N and k are big

enough. In that case, Friedman’s XF2 is undesirably conservative and derived a better

statistic:

Which is distributed according to the F-distribution with k − 1 and

(k − 1)(N − 1) degrees of freedom [1].

4

)1(

)1(

12 222 kk

Rkk

NX

j

jF

RaF vs RoF vs RRoF

2

2

)1(

)1(

F

FF

kN

NF

[1]. J. Demˇsar, “Statistical comparisons of classifiers over multiple data sets,”

JMLR, vol. 7, pp. 1–30, 2006.

44

RaF vs RoF vs RRoF• If the null-hypothesis is rejected, the Nemenyi test (Nemenyi,1963) can be

used to check whether the performance of two among k classifiers is significantly different. If the corresponding average ranks differ by at least the critical difference

we say there is a significant difference between 2classifiers.

N

kkqCD

6

)1(

45

RaF vs RoF vs RRoF

46

• The average ranks for RaF, RoF and RRoF are 2.24, 2.07, 1.69, respectively. Then XF

2 =6.98, FF =3.71. The critical value of F(2,86) for =0.05 is ~2.72, so we reject the null-hypothesis.

• In this case, CD=0.5.

RaF vs RoF vs RRoF

47

Which regularization works better ?

Method RaF(2.95) MPRaF-T (1.97) MRPaF-P(2.11) MPRaF-N(2.97)

RaF(2.95)

MPRaF-T(1.97) √ √

MPRaF-P(2.11) √ √

MPRaF-N(2.97)

Method RRoF(2.91) MPRRoF-T (2.19) MPRRoF-P(2.15) MPRRoF-N(2.75)

RRoF(2.91)

MPRRoF-T(2.19) √

MPRRoF-P(2.15) √

MPRRoF-N(2.75)

The numbers in the bracket represent the average rank for the algorithm, √means the method of the row is significantly better than the method of of the

column. Empty entry means there is no significant difference between the method

of row and the method of the column. 48

Bias-Variance point of view: Bias

Bias-RaF

the smallest value of bias/variance gets rank 1, the largest value gets highest rank.

Groups of methods with the value that are not significantly different (alpha = 0.05)

are connected.

Bias-RRoF

49

1

• For RaF, MPRaF-T and MPRaF-P generates lower bias than the others and MPRaF-P is slightly better, which demonstrates that MPSVM can better capture the geometric structure of the data.

2• For RRoF, the MPRRoF-N is significantly worse than the others.

3• This further indicates that Rotation Forest tends to generate

base classifiers with slightly lower-bias than Random Forest..

Discussion about the Bias

50

Bias-Variance point of view-Variance

Var-RaF

Var-RRoF

For all the

ensembles, no

significant

difference is

detected among

the variance of

the base

classifiers.

51

Discussion about the Variance

1

• For RF and its MPSVM based variants, though there are no significant difference among their variances, the MPRaF tend to reduce the variance to a larger extent, especially for MPRaF-T and MPRaF-P.

2• The exactly same conclusion can be drawn for RRoF

and its MPSVM based variants.

52

For a given regularization approach,

which ensemble method is better?

• we use Sign Test to compare each pair of algorithms.

• If the two algorithms compared are, as assumed under the null-hypothesis, equivalent, each should win on approximately N/2 out of N data sets: if the number of wins is at least N/2 + 1.96sqrt(N)/2, the algorithm is significantly better with p < 0.05

53

Method Significance

(MPRaF-T,MPRRoF-T) (18,26)

(MPRaF-P,MPRRoF-P) (16,28)

(MPRaF-N,MPRRoF-N) (12,32) √

For a given regularization approach, which

ensemble method is better?

The first number in the bracket represents the number of

times RaF wins, the second number means the number of

times RRoF wins. √ means there is significant difference

between this pair of algorithms.

54

• The parameter m denotes the number of features randomly selected at each node

• the smaller m is, the stronger the randomization of the trees and the weaker the dependence of their structures on the output

• However, if m is small, the features randomly selected at a node may fail to capture the geometry of the data samples

On the effect of m

55

The effect of m for “ parkinsons” dataset

RaF RRoF

56

The effect of m for “ wine quality (Wine)” dataset

RaF RRoF

57

• For very small value of m, the accuracy of all the ensemble methods are very low, especially for the MPSVM based ensembles

• However, as the m grows, the accuracies of all MPSVM based ensembles grow significantly and become stable very quickly except for the MPSVM with NULL space regularization.

Discussion about the parameter m

58

• ensemble methods with other base classifiers

• ensemble with deep learning, big data, etc.

• application for ensemble methods

Future Work

59

• We choose MPRaF-P and compare with other 179 classifier on 121 UCI dataset:

Fernández-Delgado, Manuel, et al. "Do we need hundreds of classifiers to solve real world classification problems?." The Journal of Machine Learning Research 15.1 (2014): 3133-3181.

More Comprehensive Evaluation

60

Top 2 are our proposed methods with axis parallel regularization (P)

61

Strengths & Weaknesses of RF

Easy to parallelize

Excellent batch mode performance

Online learning

Not friendly for transfer learning, visual feature extraction.

62

Neural Networks

• Theoretical proof about the approximation ability of standard multilayer feedforward network can be found in:

Hornik, Kurt, Maxwell Stinchcombe, and Halbert White. "Multilayer feedforward networks are universal approximators." Neural networks 2.5 (1989): 359-366.

• Some conditions are:

Arbitrary squashing functions.

Sufficiently many hidden units are available.

• Standard multilayer feedforward network can approximate virtually any function to any desired degree of accuracy.

63

Weakness and Improvement

• Some weakness of back-prop:

Error surface usually have multiple local minima.

Slow convergence.

Sensitivity to learning rate setting.

• Improvement:

parameters in hidden layers can be randomly and appropriately

generated without learning.

The parameters in the last layer can be computed by least squares.

Reference:

Wouter F Schmidt, Martin Kraaijveld, Robert PW Duin, et al. “Feedforward neural networks with random

weights”. In: 1992 11th IAPR International Conference on Pattern Recognition. IEEE. 1992, pp. 1–4

Yoh-Han Pao, Gwang-Hoon Park, and Dejan J Sobajic. “Learning and generalization characteristics of the

random vector functional-link net”. In: Neurocomputing 6.2 (1994), pp. 163–180.

64

Structure of RVFL

Parameters H (indicated in blue) in enhancement nodes are randomly generated in a

proper range and kept fixed.

Original features (from the input layer) are concatenated with enhanced features (from

hidden layer) to boost the performance.

Learning aims at solving yi = di ∗ W, i = 1, 2, ..., n, where n is the number of training

sample, W (indicated in red and gray) are the weights in the output layer, and y, d

represent the target and the combined features, respectively.65

RVFL details• Notations:

• X = [x1, x2, ...xn]’ : input data (n samples and m features).

• β = [β1, β2, ...βm]’ : the weights for the enhancement nodes (m × k, k is the number of enhancement nodes).

• b = [b1, b2, ...bk ] the bias for the enhancement node.

• H = h(X ∗ β + ones(n, 1) ∗ b): feature matrix after the enhancement nodes and h is the activation function.

66

Tribas – triangular basis

Relu - rectified linear Unit

RVFL details

67

d(x1)

d(x2)

…

…

…

d(xn)

=

Learning in RVFL

• Once the random parameters β and b are generated in proper range, the learning of RVFL aims at solving the following problem: yi = di ∗ W, i = 1, 2, ..., n, where n is the number of training sample, W are the weights in the output layer, and y, d represent the target and the combined features.

• Objective: min ||DW-Y||2 + λ ||W||2

• W

• Solutions:

In prime space: W = (λI + D’D) -1D’Y;

In dual space: W = D’(λI + DD’ )-1Y

where λ is the regularization parameter, when λ → 0, the methods converges to the Moore-Penrose pseudoinverse solution. D and Y are the stacked versions of the features and target.

68

Comprehensive Evaluation of RVFL

69

L. Zhang, P. N. Suganthan, "A Survey of Randomized Algorithms for

Training Neural Networks," Information Sciences, DoI: 10.1016/j.ins.2016.01.039,

Volumes 364–365, pp.146–155, Oct, 2016

http://web.mysites.ntu.edu.sg/epnsugan/PublicSite/Shared Documents/PDFs/Survey_of_Randomized_Algorithms_for_Training_Neural_Networks.pdf

http://web.mysites.ntu.edu.sg/epnsugan/PublicSite/Shared Documents/PDFs/Survey_of_Randomized_Algorithms_for_Training_Neural_Networks.pdf

http://www.sciencedirect.com/science/journal/00200255/364/supp/C

Evaluation Protocol

• Follow the exact pipeline in JMLR paper.

• Friedman rank for each method is used to evaluate the classifier.

L. Zhang, P. N. Suganthan, "A Comprehensive Evaluation of Random Vector Functional Link Networks," Information Sciences, DOI:10.1016/j.ins.2015.09.025, Volumes 367–368, pp. 1094–1105, Nov 2016. (Codes Available: 2016-RVFL-Comp-Eval-Classification)

70

http://web.mysites.ntu.edu.sg/epnsugan/PublicSite/Shared Documents/PDFs/2016-INS-RVFL-Classification.pdf

http://www.sciencedirect.com/science/journal/00200255/367/supp/C

http://web.mysites.ntu.edu.sg/epnsugan/PublicSite/Shared Documents/Forms/AllItems.aspx?RootFolder=%2fepnsugan%2fPublicSite%2fShared%20Documents%2fCodes&FolderCTID=&View=%7bDAF31868%2d97D8%2d4779%2dAE49%2d9CEC4DC3F310%7d

Results: direct link and bias

The direct links lead to better performance than those without in all cases.

The bias term in the output neurons only has mixed effects on the performance,

as it may or may not improve performance.

71

Results: Activation Function.

radbas function always leads to a better performance. hardlim and

sign activation functions lead to penultimate worst and worst

performances, respectively.

72

Results: Moore–Penrose

Pseudoinverse VS Ridge Regression

ridge regression based RVFL shows better performance than

the Moore–Penrose pseudoinverse based RVFL

73

Results: Scaling the Randomization

range of input weights and biases

74

Performance of RVFL based for different ranges of randomization. Smaller rank indicates

better accuracy and less number of hidden neurons. N stands for the number of hidden

neurons corresponding to the testing accuracy used in the ranking. In other words for each

ranking performance is maximized and corresponding number of neurons recorded.

75

Results: Scaling the randomization

range of input weights and biases

• Scaling down the randomization range of input weights and biases to avoid saturating the neurons may risk at degenerating the discrimination power of the random features. However, this can be compensated by having more hidden neurons or direct link.

• Scaling the randomization range of input weights and biases up to enhance the discrimination power of the random features may risk saturating the neurons. Again, this can be compensated by having more hidden neurons or combining with the direct link from the input to the output layer. However, for reasons explained in Section 2.5, we prefer lower model complexity, i.e. lower number of hidden neurons.

76

Ridge Regression

Linear regression :

Its solution:

Consider the feature map below which maps original feature to

higher dimension to enhance the discriminability, i.e. kernel trick:

n

n

i

T

i wwyn 1

22 ||||)x(1

Kernel Ridge Regression

According to representer theorem, one can express the solution

as a linear combination of the samples:

Objective with kernel trick:

The solution is:

Output

78

KRR for Classification

• Through the commonly used 0-1 coding:

C. Saunders, A. Gammerman and V. Vovk, "Ridge Regression Learning Algorithm in

Dual Variables", in Proc ICML 1998.

S. An, W. Liu, and S. Venkatesh, 2007, "Face recognition using kernel ridge regression",

in CVPR 2007 : Proceedings of the IEEE Computer Society Conference on Computer

Vision and Pattern Recognition, IEEE, Piscataway, N.J, pp. 1-7, 2007.

Overall performance compared with JMLR Results1

Top 10 classifier Top 11-20 classifier

Our proposed method.

Reference: Fernández-Delgado, Manuel, et al. "Do we need hundreds of classifiers to

solve real world classification problems?." The Journal of Machine Learning

Research 15.1 (2014): 3133-3181.

References

• Zhang, Le, and Ponnuthurai Nagaratnam Suganthan. "Random forests with ensemble of feature spaces." Pattern Recognition 47.10 (2014): 3429-3437.

• Zhang, Le, and Ponnuthurai N. Suganthan. "Oblique decision tree ensemble via multisurface proximal support vector machine." Cybernetics, IEEE Transactions on 45.10 (2015): 2165-2176.

• Zhang, Le, Ren, Ye, and P. N. Suganthan. "Ensemble Classification and Regression-Recent Developments, Applications and Future Directions [Review Article]." Computational Intelligence Magazine, IEEE 11.1 (2016): 41-53.

• Zhang, Le, and P. N. Suganthan. "A comprehensive evaluation of random vector functional link networks." Information Sciences (2015).

• Zhang, Le, and P. N. Suganthan. "A Survey of Randomized Algorithms for Training Neural Networks." Information Sciences (2016).

81

Time Series Forecasting in

Renewable Energy Systems

82

Outline

• Introduction– Time Series– Computational Intelligence– Ensemble Methods

• Empirical Mode Decomposition (EMD)– EMD-SVR– Ensemble EMD– Application: Wind speed forecasting

• Random Vector Functional Link (RVFL) Network– RVFL structural configurations– Application: Load demand forecasting– Application: Wind power ramp classification

83

Outline




84

Introduction – Time Series

A time series is a sequence of data points that1. Consists of successive measurements made over a time interval2. The time interval is continuous3. The distance in this time interval between any two consecutive data point is the same4. Each time unit in the time interval has at most one data point5. Univariate / multivariate6. Classification or forecasting.

85

Introduction – Applications

Finance Energy

86

Time Series Forecasting

• Historical value => future value

• Current observations => future value

• Hybrid of the above two

• Statistical approach

• Physical approach based on differential eqns.

• Computational intelligence based approach

• Hybrid/ensemble approach

Y. Ren, L. Zhang, and P. N. Suganthan, “Ensemble Classification and Regression –

Recent Developments, Applications and Future Directions,” IEEE Comput. Intell.

Mag., 2016, doi: 10.1109/MCI.2015.2471235

Time Series Forecasting (Cont’d)

• Wind speed/power forecasting

• Electricity Load demand forecasting

• Electricity Price forecasting

• Solar irradiance/power forecasting

• Load/wind/solar power ramp forecasting, etc.

88

Outline




89

Empirical Mode Decomposition (EMD)

Finally:

IMF: Intrinsic Mode FunctionsN. Huang, Z. Shen, S. Long, M. Wu, H. Shih, Q. Zheng, N. Yen, C. Tung, and H. Liu, “The

empirical mode decomposition and Hilbert spectrum for nonlinear and nonstationary time series

analysis,” Proc. Royal Society London A, vol. 454, pp. 903–995, 1998.


Finally:

N. Huang, Z. Shen, S. Long, M. Wu, H. Shih, Q.

Zheng, N. Yen, C. Tung, and H. Liu, “The

empirical mode decomposition and Hilbert

spectrum for nonlinear and nonstationary time

series analysis,” Proc. Royal Society London A,

vol. 454, pp. 903–995, 1998.


(cont’d)

92


(cont’d)

• Adaptive, Local, Orthogonal, Completeness

• Decompose complex time series into simpler time series (narrow band, symmetric)

• Reveal hidden features/correlations of the time series

• Mode mixing problem:– a sub series consists of

signal spanning a wide band of frequency or

– more than one sub series contain signals in a similar frequency band

• Ensemble of EMD => solve the problem

G. Rilling, P. Flandrin, and P. Gonc¸alv`es, “On empirical mode decomposition and its algorithms,” in

Proc. IEEE-EURASIP Workshop on Nonlinear Signal and Image Processing (NSIP’03), no. 3, Grado,

Italy, 2003, pp. 8–11.93

Ensemble of EMD

• Ensemble EMD (EEMD)– Add uncorrelated

Gaussian noise to the original time series,

– Repeat EMD to the noise added series

– Combine the results: noise will cancel each other

– But completeness is violated

Z. Wu and N. E. Huang, “Ensemble empirical mode decomposition: a noise-assisted

data analysis method,” Advances in Adaptive Data Analysis, vol. 1, no. 1, pp. 1–41,

2009.

Ensemble of EMD (cont’d)

• Complementary EEMD (CEEMD)

– Gaussian noise are in pair and complementary to each other

– Completeness is retained

– Needs more trials (ensembles)

J.-R. Yeh, J.-S. Shieh, and N. E. Huang, “Complimentary ensemble empirical mode decomposition: a novel

noise enhanced data analysis method,” Advances in Adaptive Data Analysis, vol. 2, no. 2, pp. 135– 156,

2010. 95

Ensemble of EMD (cont’d)

• Complete EEMD with Adaptive Noise (CEEMDAN)

– Adaptive noise

– Sequential process

– Reduce number of trials (ensembles)

– But cannot do parallel computing

M. Torres, M. Colominas, G. Schlotthauer, and P. Flandrin, “A complete ensemble

empirical mode decomposition with adaptive noise,” in Proc. IEEE International

Conference on Acoustics, Speech and Signal Processing (ICASSP’11), no. 1520-6149,

Prague, Czech Republic, 22-27 May 2011.

Study on EMD-SVR/ANN for wind speed

forecasting

• Forecasting models: – EMD-ANN, EEMD-ANN, CEEMD-ANN, CEEMDAN-ANN,

– EMD-SVR, EEMD-SVR, CEEMD-SVR, and CEEMDAN-SVR.

• 12 wind speed TS datasets obtained from National Data Buoy Center(NDBC).

• 70% for training and 30% for testing.

• Forecasting horizon: 1, 3 and 5 hours ahead.

• Scaled to (0,1] interval.

• Compare on RMSE/MASE

• CEEMDAN: Complementary

ensemble empirical mode decomposition

with adaptive noise.Y. Ren, P. N. Suganthan, and N. Srikanth, “A comparative study of empirical mode decomposition-based short-term wind speed forecasting methods,” IEEE Trans.

Sustain. Energy, vol. 6, no. 1, pp. 236--244, Jan. 2015.

RMSE

Study on EMD-SVR/ANN for wind speed

forecasting (cont’d)

Friedman Test of EMD based Hybrid Methodsfor 1, 3 and 5 hour ahead Wind Speed Forecasting

CPU Time (s) of EMD, EEMD, CEEMD and CEEMDANon Wind Speed TS

Concluding remarks

• EMD based hybrid SVR methods outperformed the persistence method.

• EMD based hybrid ANN methods outperformed the persistence method for 1 and 3 hour ahead forecasting only.

• EMD based hybrid SVR methods outperformed the SVR for 1, 3 and 5 hour ahead forecasting.

• EMD-ANN method had significantly worse performance than the ANN.

• EEMD/CEEMD/CEEMDAN-ANN methods had comparable performance as the ANN.

• CEEMDAN-SVR and EEMD-SVR outperformed the CEEMD-SVR and EMD-SVR for 1, 3 and 5 hour ahead forecasting.

• CEEMDAN-ANN, the EEMD-ANN and the CEEMD-ANN outperformed the EMD-ANN.

• In general, the EMD based hybrid SVR methods had better performance than the EMD based hybrid ANN methods although the SVR and ANN methods had similar performance. => EMD and its improved versions enhanced the performance of SVR on time series forecasting.

• By considering the CPU time and the number of decomposed sub series, CEEMDAN-SVR is the best performed method for wind speed time series forecasting.

Y. Ren, P. N. Suganthan, and N. Srikanth, “A comparative study of empirical mode

decomposition-based short-term wind speed forecasting methods,” IEEE Trans.

Sustain. Energy, vol. 6, no. 1, pp. 236--244, Jan. 2015.

Outline




100

Random vector functional link (RVFL)

Network

Y.-H. Pao, G.-H. Park, D. J. Sobajic, Learning and generalization characteristics of the random vector functional-

link net, Neurocomputing 6 (2) (1994) 163–180.

RVFL variations

• Effects on:– Input layer bias

– Hidden layer bias

– Direct input-output connections

Y. Ren, P. N. Suganthan, N. Srikanth and G. Amaratunga, “Single Hidden Layer Neural Networks with Random Weights for

Short-term Electricity Load Demand Forecasting,” Information Sciences, 2016.

Schmidt W F, Kraaijveld M A, Duin R P W. Feedforward neural networks with random weights[C]//Pattern Recognition, 1992. Vol.

II. Conference B: Pattern Recognition Methodology and Systems, Proceedings., 11th IAPR International Conference on. IEEE,

1992: 1-4.

RVFL variations (cont’d)

Input layer bias

Hidden layer bias

Input output

link

RVFL for load demand forecasting

• Input layer bias and hidden layer bias insignificantly affected the performance

• whereas the direct input– output connections significantly improved the performance

• Quantile scaling algorithm has improved the performance for 1–4 hour and 18–24 hour ahead forecasting horizons.

• Feature selection based on partial auto correlation function or seasonal auto-regression has consistently degraded the performance on the seasonal time series.

Random Forest for TS forecasting

• Same concept as classification but– Classification tree => regression tree

– Majority vote => averaging or selecting median for the final output

• Important parameters: – n_tree: number of bootstrap samples

– m_try: number of variables tried at each split

• Criteria at split:

– Residual sum of square: – Or minimize within group variance

– Or maximize between group variance


(cont’d)

Performance

Training time

Testing time

106


(cont’d)

• No clear overall advantage of input layer and hidden layer biases. However, the input layer biases are necessary for the neural networks to function properly as a universal approximator. => recommend retaining biases as the selection choices as they may be beneficial for some forecasting problems.

• Compared with reported non-ensemble forecasting methods such as the persistence method, seasonal ARIMA and artificial neural networks, the RVFL network has significantly better performance.

• The RVFL network is underperformed by random forest, which is an ensemble method.

• The computation time (including training and testing) of the RVFL network is the shortest compared with the reported methods.

107

Wind Power Ramp Forecasting with

RVFL

Based on

local

extrema

Based on

end points

Y. Ren, X. Qiu, P. N. Suganthan, N. Srikanth, G. Amaratunga, “Detecting Wind Power

Ramp with Random Vector Functional Link (RVFL) Network,” IEEE Symposium on

Computational Intelligence and Ensemble Learning (IEEE CIEL'15), Dec. 2015.

Wind Power Ramp Forecasting with

RVFL (cont’d)

• Noises caused by:

– Wind gust

– Wind turbine maintenance / shutdown

– Sensor noise or fault

• De-noise:

– Smooth outlier: median absolute deviation (MAD) rejection

– Missing data: extrapolation

– White noise: empirical mode decomposition (EMD)

109

Wind Power Ramp Forecasting with RVFL

(cont’d)• Imbalanced data:

– Wind ramp forecasting => classification problem

– Ramp: minority class, no-ramp: majority class => imbalanced dataset

– Up-scaling or down-scaling to recover from imbalance

• Error measure:

110

Experiment setup

Wind power time series

De-noise

Convert to power ramp (+1: significant ramp, -1: no significant ramp)

Over sampling to avoid imbalance

RVFL classification

Results and Discussion

A Fraction of Wind Power Generated in an ELIA wind farm, red segments denote power ramps

Summary of the Datasets for Wind Ramp Classification

112

Results and Discussion (cont’d)

Friedman Rank Sum Test on the Performance Measures of the Four classification Methods

Nemenyi Post-hoc Test on the Performance Measures of the Four Classification Methods with 12 Hour Window Forecasting

113

Concluding remarks

• Wind power ramp forecasting with RVFL network

• Two wind power definitions: based on (i) local extrema and (ii) end points

• RVFL network has comparable performance as ANN, RF and SVR for 6 hour ahead forecasting

• RVFL network has better performance than the SVM for 12 hour ahead forecasting

• RVFL network has significant time advantage over SVM and ANN and is comparable as RF

Conclusions

• Empirical Mode Decomposition based SVR: improve Time series forecasting accuracy

• Ensemble of EMD: CEEMDAN-SVR has best performance and short computation time

• RVFL network with input bias, without hidden bias, with direct input—output connections is the best configuration

• RVFL network has significantly better performance than non-ensemble methods for wind speed TS forecasting

• RVFL network has better performance and shorter time for wind power ramp forecasting

• RF is also highly competitive.

115

Future work

• Ensemble of EMD with multi-variate time series

• Ensemble of EMD with complex valued time series

• Ensemble of RVFL network

• Investigate in details on the feature space after enhancement node transformation

• Deep Learning ensembles

• Transfer Learning

116

Contributed by:Zhang Le (PhD 2016)

Ren Ye (PhD 2016)

Thank You !

117

Ensemble Based Classification and Forecasting...

Documents

Transcript of Ensemble Based Classification and Forecasting...