Ensemble Methods (Machine Learning,...
Transcript of Ensemble Methods (Machine Learning,...
IntroductionBagging
Boosting
Ensemble Methods(Machine Learning, SIR)
Matthieu Geist (CentraleSupelec)[email protected]
1 / 44
IntroductionBagging
Boosting
Decision treesOverview
1 IntroductionDecision treesOverview
2 BaggingBootstrap aggregatingRandom forestsExtremely randomized trees
3 BoostingAdaBoostDerivation and partial analysisRestricted functional gradient descent
2 / 44
IntroductionBagging
Boosting
Decision treesOverview
(base) learner: predictions are made based on an estimatorbuilt with a given learning algorithm
ensemble methods: improve generalization and robustness bycombining the predictions of several base learners
Figure : An illustration of bagging.Figure : An illustration of boosting.
3 / 44
IntroductionBagging
Boosting
Decision treesOverview
1 IntroductionDecision treesOverview
2 BaggingBootstrap aggregatingRandom forestsExtremely randomized trees
3 BoostingAdaBoostDerivation and partial analysisRestricted functional gradient descent
4 / 44
IntroductionBagging
Boosting
Decision treesOverview
Basic idea
f (x) =5∑
m=1
cmI{x∈Rm}.
Figure : Illustration of a regression tree.5 / 44
IntroductionBagging
Boosting
Decision treesOverview
Building a regression tree
Assume the partition (R1, . . . ,RM) is fixed, the regressionmodel associates a constant to each leaf:
f (x) =M∑
m=1
cmI{x∈Rm}
Minimize the empirical risk based on the `2-loss:
minc1,...,cM
Rn(f ) = minc1,...,cM
1
n
n∑i=1
(yi − f (xi ))2.
Solution obtained by setting the gradient to zero:
cm = ave(yi |xi ∈ Rm) =
∑ni=1 yi I{xi∈Rm}∑ni=1 I{xi∈Rm}
.
6 / 44
IntroductionBagging
Boosting
Decision treesOverview
Building a regression tree (cont.)
How to find the best binary partition in term of Rn?in general, computationally infeasible;idea: use a (heuristic) greedy approach.
Let D be the dataset:
D = {(xi , yi )1≤i≤n} with xi = (xi ,1, . . . , xi ,d)> ∈ Rd .
For a splitting variable j and a split point s, define the pair ofhalf planes
R1(j , s) = {xi ∈ D : xi ,j ≤ s} and R2(j , s) = {xi ∈ D : xi ,j > s}.
Then, search for (j , s) solving
minj ,s
minc1
∑xi∈R1(j ,s)
(yi − c1)2 + minc2
∑xi∈R2(j ,s)
(yi − c2)2
.
7 / 44
IntroductionBagging
Boosting
Decision treesOverview
Building a regression tree (cont.)
There is a finite number of (j , s) couples:
d splitting variables: 1 ≤ j ≤ d ;n − 1 split points for each splitting variable j : obtained byordering the j th components.
The inner minimization problem is easily solved for any (j , s):
c1 = ave(yi |xi ∈ R1(j , s)) and c2 = ave(yi |xi ∈ R2(j , s)).
Therefore, by scanning through each dimension of all theinputs, determination of the best pair (j , s) is feasible.
Then, having found the best split, we partition the data intothe two resulting regions and repeat the splitting procedurefor each of these regions.
And so on, until a stopping criterion is reached.
8 / 44
IntroductionBagging
Boosting
Decision treesOverview
Building a regression tree (cont.)
Possible stopping criterion:
max. depth;max. number of leaves;max. number of samples per leaf;etc.
Stopping criterion is important:
a very large tree will overfit the data (consider a tree with asmany leaves as samples);a too small tree might not capture the important structure(e.g., decision stump, tree with two nodes).
9 / 44
IntroductionBagging
Boosting
Decision treesOverview
Building a classification tree
classification treey ∈ {1, . . . ,K};the `2-loss is not the best choice, we should consider anothercriteria for splitting.
For a node m representing a region Rm, consider:the number of data points nm =
∑ni=1 I{xi∈Rm};
the associated dataset Dm = {(xi , yi ) ∈ D : xi ∈ Rm};the proportion of class k observations in node m
pm,k =1
nm
∑xi∈Rm
I{yi=k};
the majority class k(m) = argmax1≤k≤K pm,k .
in regression, partitioning was based on the node impurity
Q(Dm) =1
nm
∑xi∈Rm
(yi − cm)2,
what node impurity should one choose for classification?10 / 44
IntroductionBagging
Boosting
Decision treesOverview
Building a classification tree (cont.)
misclassification error,
Q(Dm) =1
nm
∑xi∈Rm
I{yi 6=k(m)}
= 1− pm,k(m);
Gini index,
Q(Dm) =∑k 6=k′
pm,k pm,k′
=K∑
k=1
pm,k(1− pm,k);
cross-entropy,
Q(Dm) = −K∑
k=1
pm,k ln pm,k
0.0 0.2 0.4 0.6 0.8 1.0
p
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
misclassification errorGini indexcross-entropy
Figure : Measure of node impurity forbinary classification, as a function of theproportion p in the second class.
11 / 44
IntroductionBagging
Boosting
Decision treesOverview
Building a classification tree (cont.)
The tree is grown as previously.
For a region Rm:
consider a couple (j , s) of splitting variable and split point;write Dm,L(j , s) the resulting dataset of the left node (of sizenmL
);write Dm,R(j , s) the dataset of the right node (of size nmR
).
The tree is growth by solving
minj ,s
(nmLQ(Dm,L(j , s)) + nmR
Q(Dm,R(j , s))) .
12 / 44
IntroductionBagging
Boosting
Decision treesOverview
Summary
Trees are interpretable predictors, quite easy to train.
Other advantages:
can handle categorical values;can be extended to cost-sensitive learning;can handle missing input values.
However, trees are unstable:
they have a high variance: a small change in the data canresult in a very different tree;mainly caused by the hierarchical nature of the process (theeffect of an error is propagated down);the smaller the tree, the more stable it is, but the weaker thelearner is.
13 / 44
IntroductionBagging
Boosting
Decision treesOverview
1 IntroductionDecision treesOverview
2 BaggingBootstrap aggregatingRandom forestsExtremely randomized trees
3 BoostingAdaBoostDerivation and partial analysisRestricted functional gradient descent
14 / 44
IntroductionBagging
Boosting
Decision treesOverview
bagging (or more generally averaging methods):
build in parallel several estimators, more or less independently,and average their predictions;reduce the variance for the combined estimator;applicable for example to deep trees.
boosting:
build sequentially (weak) base estimators such as reducing thebias of the combined estimators;motivation: combine weak learners to form a strong learner;applicable for example to decision stumps.
There exist other ensemble methods (Bayesian modelaveraging, mixture of expert or stacking, for exampe), notaddressed here.
15 / 44
IntroductionBagging
Boosting
Bootstrap aggregatingRandom forestsExtremely randomized trees
1 IntroductionDecision treesOverview
2 BaggingBootstrap aggregatingRandom forestsExtremely randomized trees
3 BoostingAdaBoostDerivation and partial analysisRestricted functional gradient descent
16 / 44
IntroductionBagging
Boosting
Bootstrap aggregatingRandom forestsExtremely randomized trees
Motivation
Abstract problem:let X1, . . . ,XB be i.i.d. random variables, of mean µ = E [X1]and of variance σ2 = var(X1) = E
[(X1 − µ)2
];
consider the empirical mean µB = 1B
∑Bb=1 Xb;
the expectation does not change, E [µB ] = µ, while thevariance is reduced, var(µB) = 1
B σ2.
Supervised learning:the random quantity is the dataset D = {(xi , yi )1≤i≤n};from this, an estimate fD is computed by minimizing theempirical risk of interest.
Ideal case:let D1, . . . ,DB be datasets drawn independently, and writefb = fDb
the associated minimizer;
define fave = 1B
∑Bb=1 fb;
this does not change the expectation, E [fave] = E [f1], but itreduces the variance, var(fave) = 1
B var(f1).
Unfortunately, it is not possible to sample datasets ondemand... 17 / 44
IntroductionBagging
Boosting
Bootstrap aggregatingRandom forestsExtremely randomized trees
Illustration
18 / 44
IntroductionBagging
Boosting
Bootstrap aggregatingRandom forestsExtremely randomized trees
1 IntroductionDecision treesOverview
2 BaggingBootstrap aggregatingRandom forestsExtremely randomized trees
3 BoostingAdaBoostDerivation and partial analysisRestricted functional gradient descent
19 / 44
IntroductionBagging
Boosting
Bootstrap aggregatingRandom forestsExtremely randomized trees
bagging = bootstrap + aggregating
bootstrap = random sampling with replacement (statistics)
Figure : Illustration of the bootstrapping principle.
20 / 44
IntroductionBagging
Boosting
Bootstrap aggregatingRandom forestsExtremely randomized trees
bagging:bootstrap the dataset to get B datasets Db (same size as D),1 ≤ b ≤ B;learn a predictor fDb
= fb for each of these datasets;average the predictors:
fbag(x) =1
B
B∑b=1
fb(x)
Apply directly for regression.For classification, averaging the predicted classes would notmake sense. Possible strategies:
do a majority vote over trees:
fbag = argmax1≤i≤K
(1
B
B∑b=1
I{fb(x)=k}
).
consider the class proportion for the leave corresponding to theinput of interest (pm,k , with x ∈ Rm) for each tree, averagethem over all trees and output the class that maximizes thisaveraged class proportion.
21 / 44
IntroductionBagging
Boosting
Bootstrap aggregatingRandom forestsExtremely randomized trees
Variations of the bagging approach
Bagging: draw sample with replacement.
Pasting: alternatively, random subsets of the dataset can bedrawn as random subsets of the samples.
Random subspaces: one can also selects randomly a subset ofthe components of the inputs (which is generallymulti-dimensional) to learn models.
Random patches: combination of the two previous methods.
22 / 44
IntroductionBagging
Boosting
Bootstrap aggregatingRandom forestsExtremely randomized trees
1 IntroductionDecision treesOverview
2 BaggingBootstrap aggregatingRandom forestsExtremely randomized trees
3 BoostingAdaBoostDerivation and partial analysisRestricted functional gradient descent
23 / 44
IntroductionBagging
Boosting
Bootstrap aggregatingRandom forestsExtremely randomized trees
Motivation
There is necessarily some overlap between bootstrappeddatasets, so the models corresponding to each of thesedatasets are correlated.Abstract problem
let X1, . . . ,XB be i.d. but not independent random variables,of mean µ = E [X1] and of varianceσ2 = var(X1) = E
[(X1 − µ)2
];
consider the empirical mean µB = 1B
∑Bb=1 Xb;
assume a positive pairwise correlation ρ:
for i 6= j , ρ =E [(Xi − µ)(Xj − µ)]√
var(Xi ) var(Xj).
the expectation does not change, E [µB ] = µ, but the varianceis now given by
var(µB) = ρσ2 +1− ρ
Bσ2.
24 / 44
IntroductionBagging
Boosting
Bootstrap aggregatingRandom forestsExtremely randomized trees
Random forest: reduce the correlation between the trees byrandomizing their constructions, without increasing thevariance too much.
Algorithm 1 Random ForestRequire: A dataset D = {(xi , yi )1≤i≤n}, the size B of the ensemble, the number
m of candidates for splitting.for b = 1 to B do
Draw a bootstrap dataset Db of size n from the original training set D.Grow a random tree using the bootstrapped dataset:repeat
for all terminal node doSelect m variables among d , at random.Pick the best variable and split-point couple among the m.Split the node into two daughter nodes.
end foruntil the stopping criterion is met (e.g., minimum number of sample pernode reached)
end forreturn the ensemble of B trees.
25 / 44
IntroductionBagging
Boosting
Bootstrap aggregatingRandom forestsExtremely randomized trees
1 IntroductionDecision treesOverview
2 BaggingBootstrap aggregatingRandom forestsExtremely randomized trees
3 BoostingAdaBoostDerivation and partial analysisRestricted functional gradient descent
26 / 44
IntroductionBagging
Boosting
Bootstrap aggregatingRandom forestsExtremely randomized trees
Extremly randomized forest pushed further the randomization.The two differences with random forests are:
split-points are also chosen randomly, one for each randomsplitting dimension;the full learning dataset is used for growing each tree.
The rationale for:
choosing also the split-point at random is to further reduce thecorrelation between trees;using the full learning set is to achieve a lower bias (at theprice of an increased variance, that should be compensated bythe randomization of split-points).
27 / 44
IntroductionBagging
Boosting
Bootstrap aggregatingRandom forestsExtremely randomized trees
Algorithm 2 Extremely Randomized ForestRequire: A dataset D = {(xi , yi )1≤i≤n}, the size B of the ensemble, the number
m of candidates for splitting.for b = 1 to B do
Grow a random tree using the original dataset:repeat
for all terminal node doSelect m variables among d , at random.for all sampled variables do
Select a split at randomend forPick the best variable and split-point couple among the m candidates.Split the node into two daughter nodes.
end foruntil the stopping criterion is met (e.g., minimum number of sample pernode reached)
end forreturn the ensemble of B trees.
28 / 44
IntroductionBagging
Boosting
AdaBoostDerivation and partial analysisRestricted functional gradient descent
1 IntroductionDecision treesOverview
2 BaggingBootstrap aggregatingRandom forestsExtremely randomized trees
3 BoostingAdaBoostDerivation and partial analysisRestricted functional gradient descent
29 / 44
IntroductionBagging
Boosting
AdaBoostDerivation and partial analysisRestricted functional gradient descent
Boosting takes a different route from bagging:bagging: learning in parallel a set of models with low bias andhigh variance (learning being randomized, for example throughbootstrapping), the prediction being made by averaging allthese models.boosting: add sequentially models with high bias and lowvariance such as reducing the bias of the ensemble
For boosting, we will focus on binary classification.
30 / 44
IntroductionBagging
Boosting
AdaBoostDerivation and partial analysisRestricted functional gradient descent
1 IntroductionDecision treesOverview
2 BaggingBootstrap aggregatingRandom forestsExtremely randomized trees
3 BoostingAdaBoostDerivation and partial analysisRestricted functional gradient descent
31 / 44
IntroductionBagging
Boosting
AdaBoostDerivation and partial analysisRestricted functional gradient descent
Weighted binary classification
We consider binary classification:the available dataset is of the form D = {(xi , yi )1≤i≤n} withyi ∈ {−1,+1};the empirical risk of interest is
Rn(f ) =1
n
n∑i=1
I{yi 6=f (xi )} =n∑
i=1
1
nI{yi 6=f (xi )};
all samples have the same importance (each sample has aweight 1
n )
Weighted binary classification:we want to associate a different weight wi to each example(xi , yi ), such that
∑i wi = 1 and wi ≥ 0;
the corresponding empirical risk is
Rn(f ) =n∑
i=1
wi I{yi 6=f (xi )}.
32 / 44
IntroductionBagging
Boosting
AdaBoostDerivation and partial analysisRestricted functional gradient descent
Weighted binary classification (cont.)
For minimizing the weighted empirical risk:
sample a bootstrap replicate according to the discretedistribution (w1, . . . ,wn) and minimize the empirical risk onthis new dataset;do it more directly, in a problem dependent manner.
The case of classification trees:
the node impurity can be easily adaptedfor example, for the misclassification error, consider:
Q(Dm) =∑xi∈Rm
wi I{yi 6=k(m)}.
In the sequel, we assume that a (weak) base learner is available(typically, a decision stump), and that it is able to minimize theweighted risk using the dataset D and weights wi , 1 ≤ i ≤ n.
33 / 44
IntroductionBagging
Boosting
AdaBoostDerivation and partial analysisRestricted functional gradient descent
Algorithm 3 The AdaBoost (Adaptive Boosting) algorithmRequire: A dataset D = {(xi , yi )1≤i≤n}, the size T of the ensemble.
1: Initialize the weights w1i = 1
n, 1 ≤ i ≤ n
2: for t = 1 to T do3: Fit a binary classifier ft (x) to the training data using weights w t
i .
4: Compute the error εt made by this classifier:
εt =n∑
i=1
w ti I{yi 6=ft (xi )}.
5: Compute the learning rate αt :
αt =1
2ln
(1− εtεt
).
6: Update the weights, for all 1 ≤ i ≤ n:
w t+1i =
w ti e−αt yi ft (xi )∑n
j=1 w tj e−αt yj ft (xj )
.
7: end for8: return the decision rule
HT (x) = sgn (FT (x)) with FT (x) =T∑t=1
αt ft (x).
34 / 44
IntroductionBagging
Boosting
AdaBoostDerivation and partial analysisRestricted functional gradient descent
1 IntroductionDecision treesOverview
2 BaggingBootstrap aggregatingRandom forestsExtremely randomized trees
3 BoostingAdaBoostDerivation and partial analysisRestricted functional gradient descent
35 / 44
IntroductionBagging
Boosting
AdaBoostDerivation and partial analysisRestricted functional gradient descent
Forward stagewise additive modeling
A convex surrogate for binary classification is the risk basedon the exponential loss:
Rn(F ) =1
n
n∑i=1
e−yiF (xi ) with F ∈ RX .
We’re looking for an additive model of the form:
FT (x) =T∑t=1
αt ft(x) with ft ∈ {−1,+1}X and αt ∈ R.
The corresponding optimization problem is therefore:
min(αt ,ft)1≤t≤T
1
n
n∑i=1
e−yi∑T
t=1 αt ft(xi ).
Yet, this optimization problem is too complicated.
36 / 44
IntroductionBagging
Boosting
AdaBoostDerivation and partial analysisRestricted functional gradient descent
Forward stagewise additive modeling (cont.)
A simple alternative is to search for an approximate solutionby sequentially adding basis functions and associate weights:
define F0 = 0;define Ft = Ft−1 + αt ft ;solve sequentially the subproblems
minα,f
1
n
n∑i=1
e−yi (Ft−1(xi )+αf (xi )).
This approach:
is reminiscent of gradient descent (with line search);can be straightforwardly abstracted to any loss function L:
minα,f
1
n
n∑i=1
L(yi ,Ft−1(xi ) + αf (xi )).
37 / 44
IntroductionBagging
Boosting
AdaBoostDerivation and partial analysisRestricted functional gradient descent
Forward stagewise additive modeling (cont.)
At each iteration t ≥ 1, we have to solve
(αt , ft) = argminα,f
1
n
n∑i=1
e−yi (Ft−1(xi )+αf (xi )).
The solution is given by
ft = argminf ∈H
n∑i=1
w ti I{yi 6=f (xi )} with w t
i =e−yiFt−1(xi )∑nj=1 e−yjFt−1(xj )
and αt =1
2ln
(1− εtεt
)with εt =
n∑i=1
w ti I{yi 6=ft(xi )}.
This is the AdaBoost Algorithm!
38 / 44
IntroductionBagging
Boosting
AdaBoostDerivation and partial analysisRestricted functional gradient descent
Bounding the empirical risk
Define the edge γt = 12 − εt , which measures how much
better than random guessing (error rate of 12 ) is the error rate
of the tth learned classifier ft
how good is the ensemble HT (x) = sgn(FT (x)) compared toa single weak learner?
Theorem
Write γt = 1− εt the edge of the tth classifier. The empirical riskof the combined classifier HT produced by AdaBoost satisfies
1
n
n∑i=1
I{yi 6=HT (xi )} ≤T∏t=1
√1− 4γ2
t ≤ e−2∑T
t=1 γ2t .
39 / 44
IntroductionBagging
Boosting
AdaBoostDerivation and partial analysisRestricted functional gradient descent
1 IntroductionDecision treesOverview
2 BaggingBootstrap aggregatingRandom forestsExtremely randomized trees
3 BoostingAdaBoostDerivation and partial analysisRestricted functional gradient descent
40 / 44
IntroductionBagging
Boosting
AdaBoostDerivation and partial analysisRestricted functional gradient descent
Rationale
Consider the binary classification problem with a convexsurrogate:
we’re looking for a classifier H(x) = sgn(F (x)) with F ∈ RX ;let L(y ,F (x)) be a convex surrogate to the binary loss (forexample, the exponential loss, L(y ,F (x)) = e−yF (x));we would like to minimize the empirical risk:
minF∈RX
Rn(F ) with Rn(F ) =1
n
n∑i=1
L(yi ,F (xi )).
Convex optimization:as a sum of convex function, Rn is convex in F ;a standard approach for minimizing a convex function is toperform a gradient descent:
Ft+1 = Ft − αt∇FRn(Ft).
Problem: F is a function, what is ∇F ? Is it usable?
41 / 44
IntroductionBagging
Boosting
AdaBoostDerivation and partial analysisRestricted functional gradient descent
Functional gradient and related Hilbert space
Assume that X is measurable and let µ be a probabilitymeasure.
Define L2(X ,R, µ) the set of all equivalence classes offunctions F ∈ RX such that the Lebesgue integral∫X F (x)2dµ(x) is finite.
This Hilbert space has a natural inner product:〈F ,G 〉µ =
∫X F (x)G (x)dµ(x).
For a functional J : L2(X ,R, µ)→ R, the Frechet derivative isthe linear operator ∇J(F ) satisfying
limG→0J(F + G )− J(F )− 〈∇J(F ),G 〉µ
‖G‖µ= 0.
42 / 44
IntroductionBagging
Boosting
AdaBoostDerivation and partial analysisRestricted functional gradient descent
Back to risk minimization
The probability measure we’re interested in here is the discret measurethat associates a probability 1
nto each input xi . The associated inner
product is
〈F ,G〉n =1
n
n∑i=1
F (xi )G(xi ).
We can compute the Frechet derivative, so we can write the gradientdescent:
Ft+1 = Ft − αt∇FRn(Ft).
However, the Frechet derivative is a set of function, known only in thedatapoints xi . It does not allows generalizing and it is not a practicalobject for computing. The idea is therefore to “restrict” this gradient tothe hypothesis space H of interest, by looking for the function of H beingthe more collinear to the gradient (with a comparable norm)
ft ∈ argmaxf∈H
〈∇FRn(Ft), f 〉n‖f ‖n
.
then, we replace the gradient by its approximation in the update rule:
Ft = Ft−1 − αt ft .
43 / 44
IntroductionBagging
Boosting
AdaBoostDerivation and partial analysisRestricted functional gradient descent
Application to the exponential loss
Consider the risk based on the exponential loss:
Rn(F ) =1
n
n∑i=1
e−yiF (xi ).
Restricting the gradient gives
argmaxf∈H
〈∇Rn(Ft), f 〉n‖f ‖n
= argmaxf∈H
1
n
n∑i=1
e−yiFt(xi )I{yi 6=f (xi )} = −ft
The optimal learning rate can be obtained by performing a linesearch:
αt = argminα>0
Rn(Ft + αft) =1
2ln
(1− εtεt
)with εt =
∑ni=1 e−yiFt(xi )I{yi 6=ft(xi )}∑n
i=1 e−yiFt(xi ).
This is Adaboost!
44 / 44