Ensemble Methods (Machine Learning,...

44
Introduction Bagging Boosting Ensemble Methods (Machine Learning, SIR) Matthieu Geist (CentraleSup´ elec) [email protected] 1 / 44

Transcript of Ensemble Methods (Machine Learning,...

Page 1: Ensemble Methods (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_ensembles.pdf · Decision trees Overview Building a regression tree (cont.) Possible stopping

IntroductionBagging

Boosting

Ensemble Methods(Machine Learning, SIR)

Matthieu Geist (CentraleSupelec)[email protected]

1 / 44

Page 2: Ensemble Methods (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_ensembles.pdf · Decision trees Overview Building a regression tree (cont.) Possible stopping

IntroductionBagging

Boosting

Decision treesOverview

1 IntroductionDecision treesOverview

2 BaggingBootstrap aggregatingRandom forestsExtremely randomized trees

3 BoostingAdaBoostDerivation and partial analysisRestricted functional gradient descent

2 / 44

Page 3: Ensemble Methods (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_ensembles.pdf · Decision trees Overview Building a regression tree (cont.) Possible stopping

IntroductionBagging

Boosting

Decision treesOverview

(base) learner: predictions are made based on an estimatorbuilt with a given learning algorithm

ensemble methods: improve generalization and robustness bycombining the predictions of several base learners

Figure : An illustration of bagging.Figure : An illustration of boosting.

3 / 44

Page 4: Ensemble Methods (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_ensembles.pdf · Decision trees Overview Building a regression tree (cont.) Possible stopping

IntroductionBagging

Boosting

Decision treesOverview

1 IntroductionDecision treesOverview

2 BaggingBootstrap aggregatingRandom forestsExtremely randomized trees

3 BoostingAdaBoostDerivation and partial analysisRestricted functional gradient descent

4 / 44

Page 5: Ensemble Methods (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_ensembles.pdf · Decision trees Overview Building a regression tree (cont.) Possible stopping

IntroductionBagging

Boosting

Decision treesOverview

Basic idea

f (x) =5∑

m=1

cmI{x∈Rm}.

Figure : Illustration of a regression tree.5 / 44

Page 6: Ensemble Methods (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_ensembles.pdf · Decision trees Overview Building a regression tree (cont.) Possible stopping

IntroductionBagging

Boosting

Decision treesOverview

Building a regression tree

Assume the partition (R1, . . . ,RM) is fixed, the regressionmodel associates a constant to each leaf:

f (x) =M∑

m=1

cmI{x∈Rm}

Minimize the empirical risk based on the `2-loss:

minc1,...,cM

Rn(f ) = minc1,...,cM

1

n

n∑i=1

(yi − f (xi ))2.

Solution obtained by setting the gradient to zero:

cm = ave(yi |xi ∈ Rm) =

∑ni=1 yi I{xi∈Rm}∑ni=1 I{xi∈Rm}

.

6 / 44

Page 7: Ensemble Methods (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_ensembles.pdf · Decision trees Overview Building a regression tree (cont.) Possible stopping

IntroductionBagging

Boosting

Decision treesOverview

Building a regression tree (cont.)

How to find the best binary partition in term of Rn?in general, computationally infeasible;idea: use a (heuristic) greedy approach.

Let D be the dataset:

D = {(xi , yi )1≤i≤n} with xi = (xi ,1, . . . , xi ,d)> ∈ Rd .

For a splitting variable j and a split point s, define the pair ofhalf planes

R1(j , s) = {xi ∈ D : xi ,j ≤ s} and R2(j , s) = {xi ∈ D : xi ,j > s}.

Then, search for (j , s) solving

minj ,s

minc1

∑xi∈R1(j ,s)

(yi − c1)2 + minc2

∑xi∈R2(j ,s)

(yi − c2)2

.

7 / 44

Page 8: Ensemble Methods (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_ensembles.pdf · Decision trees Overview Building a regression tree (cont.) Possible stopping

IntroductionBagging

Boosting

Decision treesOverview

Building a regression tree (cont.)

There is a finite number of (j , s) couples:

d splitting variables: 1 ≤ j ≤ d ;n − 1 split points for each splitting variable j : obtained byordering the j th components.

The inner minimization problem is easily solved for any (j , s):

c1 = ave(yi |xi ∈ R1(j , s)) and c2 = ave(yi |xi ∈ R2(j , s)).

Therefore, by scanning through each dimension of all theinputs, determination of the best pair (j , s) is feasible.

Then, having found the best split, we partition the data intothe two resulting regions and repeat the splitting procedurefor each of these regions.

And so on, until a stopping criterion is reached.

8 / 44

Page 9: Ensemble Methods (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_ensembles.pdf · Decision trees Overview Building a regression tree (cont.) Possible stopping

IntroductionBagging

Boosting

Decision treesOverview

Building a regression tree (cont.)

Possible stopping criterion:

max. depth;max. number of leaves;max. number of samples per leaf;etc.

Stopping criterion is important:

a very large tree will overfit the data (consider a tree with asmany leaves as samples);a too small tree might not capture the important structure(e.g., decision stump, tree with two nodes).

9 / 44

Page 10: Ensemble Methods (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_ensembles.pdf · Decision trees Overview Building a regression tree (cont.) Possible stopping

IntroductionBagging

Boosting

Decision treesOverview

Building a classification tree

classification treey ∈ {1, . . . ,K};the `2-loss is not the best choice, we should consider anothercriteria for splitting.

For a node m representing a region Rm, consider:the number of data points nm =

∑ni=1 I{xi∈Rm};

the associated dataset Dm = {(xi , yi ) ∈ D : xi ∈ Rm};the proportion of class k observations in node m

pm,k =1

nm

∑xi∈Rm

I{yi=k};

the majority class k(m) = argmax1≤k≤K pm,k .

in regression, partitioning was based on the node impurity

Q(Dm) =1

nm

∑xi∈Rm

(yi − cm)2,

what node impurity should one choose for classification?10 / 44

Page 11: Ensemble Methods (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_ensembles.pdf · Decision trees Overview Building a regression tree (cont.) Possible stopping

IntroductionBagging

Boosting

Decision treesOverview

Building a classification tree (cont.)

misclassification error,

Q(Dm) =1

nm

∑xi∈Rm

I{yi 6=k(m)}

= 1− pm,k(m);

Gini index,

Q(Dm) =∑k 6=k′

pm,k pm,k′

=K∑

k=1

pm,k(1− pm,k);

cross-entropy,

Q(Dm) = −K∑

k=1

pm,k ln pm,k

0.0 0.2 0.4 0.6 0.8 1.0

p

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

misclassification errorGini indexcross-entropy

Figure : Measure of node impurity forbinary classification, as a function of theproportion p in the second class.

11 / 44

Page 12: Ensemble Methods (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_ensembles.pdf · Decision trees Overview Building a regression tree (cont.) Possible stopping

IntroductionBagging

Boosting

Decision treesOverview

Building a classification tree (cont.)

The tree is grown as previously.

For a region Rm:

consider a couple (j , s) of splitting variable and split point;write Dm,L(j , s) the resulting dataset of the left node (of sizenmL

);write Dm,R(j , s) the dataset of the right node (of size nmR

).

The tree is growth by solving

minj ,s

(nmLQ(Dm,L(j , s)) + nmR

Q(Dm,R(j , s))) .

12 / 44

Page 13: Ensemble Methods (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_ensembles.pdf · Decision trees Overview Building a regression tree (cont.) Possible stopping

IntroductionBagging

Boosting

Decision treesOverview

Summary

Trees are interpretable predictors, quite easy to train.

Other advantages:

can handle categorical values;can be extended to cost-sensitive learning;can handle missing input values.

However, trees are unstable:

they have a high variance: a small change in the data canresult in a very different tree;mainly caused by the hierarchical nature of the process (theeffect of an error is propagated down);the smaller the tree, the more stable it is, but the weaker thelearner is.

13 / 44

Page 14: Ensemble Methods (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_ensembles.pdf · Decision trees Overview Building a regression tree (cont.) Possible stopping

IntroductionBagging

Boosting

Decision treesOverview

1 IntroductionDecision treesOverview

2 BaggingBootstrap aggregatingRandom forestsExtremely randomized trees

3 BoostingAdaBoostDerivation and partial analysisRestricted functional gradient descent

14 / 44

Page 15: Ensemble Methods (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_ensembles.pdf · Decision trees Overview Building a regression tree (cont.) Possible stopping

IntroductionBagging

Boosting

Decision treesOverview

bagging (or more generally averaging methods):

build in parallel several estimators, more or less independently,and average their predictions;reduce the variance for the combined estimator;applicable for example to deep trees.

boosting:

build sequentially (weak) base estimators such as reducing thebias of the combined estimators;motivation: combine weak learners to form a strong learner;applicable for example to decision stumps.

There exist other ensemble methods (Bayesian modelaveraging, mixture of expert or stacking, for exampe), notaddressed here.

15 / 44

Page 16: Ensemble Methods (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_ensembles.pdf · Decision trees Overview Building a regression tree (cont.) Possible stopping

IntroductionBagging

Boosting

Bootstrap aggregatingRandom forestsExtremely randomized trees

1 IntroductionDecision treesOverview

2 BaggingBootstrap aggregatingRandom forestsExtremely randomized trees

3 BoostingAdaBoostDerivation and partial analysisRestricted functional gradient descent

16 / 44

Page 17: Ensemble Methods (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_ensembles.pdf · Decision trees Overview Building a regression tree (cont.) Possible stopping

IntroductionBagging

Boosting

Bootstrap aggregatingRandom forestsExtremely randomized trees

Motivation

Abstract problem:let X1, . . . ,XB be i.i.d. random variables, of mean µ = E [X1]and of variance σ2 = var(X1) = E

[(X1 − µ)2

];

consider the empirical mean µB = 1B

∑Bb=1 Xb;

the expectation does not change, E [µB ] = µ, while thevariance is reduced, var(µB) = 1

B σ2.

Supervised learning:the random quantity is the dataset D = {(xi , yi )1≤i≤n};from this, an estimate fD is computed by minimizing theempirical risk of interest.

Ideal case:let D1, . . . ,DB be datasets drawn independently, and writefb = fDb

the associated minimizer;

define fave = 1B

∑Bb=1 fb;

this does not change the expectation, E [fave] = E [f1], but itreduces the variance, var(fave) = 1

B var(f1).

Unfortunately, it is not possible to sample datasets ondemand... 17 / 44

Page 18: Ensemble Methods (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_ensembles.pdf · Decision trees Overview Building a regression tree (cont.) Possible stopping

IntroductionBagging

Boosting

Bootstrap aggregatingRandom forestsExtremely randomized trees

Illustration

18 / 44

Page 19: Ensemble Methods (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_ensembles.pdf · Decision trees Overview Building a regression tree (cont.) Possible stopping

IntroductionBagging

Boosting

Bootstrap aggregatingRandom forestsExtremely randomized trees

1 IntroductionDecision treesOverview

2 BaggingBootstrap aggregatingRandom forestsExtremely randomized trees

3 BoostingAdaBoostDerivation and partial analysisRestricted functional gradient descent

19 / 44

Page 20: Ensemble Methods (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_ensembles.pdf · Decision trees Overview Building a regression tree (cont.) Possible stopping

IntroductionBagging

Boosting

Bootstrap aggregatingRandom forestsExtremely randomized trees

bagging = bootstrap + aggregating

bootstrap = random sampling with replacement (statistics)

Figure : Illustration of the bootstrapping principle.

20 / 44

Page 21: Ensemble Methods (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_ensembles.pdf · Decision trees Overview Building a regression tree (cont.) Possible stopping

IntroductionBagging

Boosting

Bootstrap aggregatingRandom forestsExtremely randomized trees

bagging:bootstrap the dataset to get B datasets Db (same size as D),1 ≤ b ≤ B;learn a predictor fDb

= fb for each of these datasets;average the predictors:

fbag(x) =1

B

B∑b=1

fb(x)

Apply directly for regression.For classification, averaging the predicted classes would notmake sense. Possible strategies:

do a majority vote over trees:

fbag = argmax1≤i≤K

(1

B

B∑b=1

I{fb(x)=k}

).

consider the class proportion for the leave corresponding to theinput of interest (pm,k , with x ∈ Rm) for each tree, averagethem over all trees and output the class that maximizes thisaveraged class proportion.

21 / 44

Page 22: Ensemble Methods (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_ensembles.pdf · Decision trees Overview Building a regression tree (cont.) Possible stopping

IntroductionBagging

Boosting

Bootstrap aggregatingRandom forestsExtremely randomized trees

Variations of the bagging approach

Bagging: draw sample with replacement.

Pasting: alternatively, random subsets of the dataset can bedrawn as random subsets of the samples.

Random subspaces: one can also selects randomly a subset ofthe components of the inputs (which is generallymulti-dimensional) to learn models.

Random patches: combination of the two previous methods.

22 / 44

Page 23: Ensemble Methods (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_ensembles.pdf · Decision trees Overview Building a regression tree (cont.) Possible stopping

IntroductionBagging

Boosting

Bootstrap aggregatingRandom forestsExtremely randomized trees

1 IntroductionDecision treesOverview

2 BaggingBootstrap aggregatingRandom forestsExtremely randomized trees

3 BoostingAdaBoostDerivation and partial analysisRestricted functional gradient descent

23 / 44

Page 24: Ensemble Methods (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_ensembles.pdf · Decision trees Overview Building a regression tree (cont.) Possible stopping

IntroductionBagging

Boosting

Bootstrap aggregatingRandom forestsExtremely randomized trees

Motivation

There is necessarily some overlap between bootstrappeddatasets, so the models corresponding to each of thesedatasets are correlated.Abstract problem

let X1, . . . ,XB be i.d. but not independent random variables,of mean µ = E [X1] and of varianceσ2 = var(X1) = E

[(X1 − µ)2

];

consider the empirical mean µB = 1B

∑Bb=1 Xb;

assume a positive pairwise correlation ρ:

for i 6= j , ρ =E [(Xi − µ)(Xj − µ)]√

var(Xi ) var(Xj).

the expectation does not change, E [µB ] = µ, but the varianceis now given by

var(µB) = ρσ2 +1− ρ

Bσ2.

24 / 44

Page 25: Ensemble Methods (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_ensembles.pdf · Decision trees Overview Building a regression tree (cont.) Possible stopping

IntroductionBagging

Boosting

Bootstrap aggregatingRandom forestsExtremely randomized trees

Random forest: reduce the correlation between the trees byrandomizing their constructions, without increasing thevariance too much.

Algorithm 1 Random ForestRequire: A dataset D = {(xi , yi )1≤i≤n}, the size B of the ensemble, the number

m of candidates for splitting.for b = 1 to B do

Draw a bootstrap dataset Db of size n from the original training set D.Grow a random tree using the bootstrapped dataset:repeat

for all terminal node doSelect m variables among d , at random.Pick the best variable and split-point couple among the m.Split the node into two daughter nodes.

end foruntil the stopping criterion is met (e.g., minimum number of sample pernode reached)

end forreturn the ensemble of B trees.

25 / 44

Page 26: Ensemble Methods (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_ensembles.pdf · Decision trees Overview Building a regression tree (cont.) Possible stopping

IntroductionBagging

Boosting

Bootstrap aggregatingRandom forestsExtremely randomized trees

1 IntroductionDecision treesOverview

2 BaggingBootstrap aggregatingRandom forestsExtremely randomized trees

3 BoostingAdaBoostDerivation and partial analysisRestricted functional gradient descent

26 / 44

Page 27: Ensemble Methods (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_ensembles.pdf · Decision trees Overview Building a regression tree (cont.) Possible stopping

IntroductionBagging

Boosting

Bootstrap aggregatingRandom forestsExtremely randomized trees

Extremly randomized forest pushed further the randomization.The two differences with random forests are:

split-points are also chosen randomly, one for each randomsplitting dimension;the full learning dataset is used for growing each tree.

The rationale for:

choosing also the split-point at random is to further reduce thecorrelation between trees;using the full learning set is to achieve a lower bias (at theprice of an increased variance, that should be compensated bythe randomization of split-points).

27 / 44

Page 28: Ensemble Methods (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_ensembles.pdf · Decision trees Overview Building a regression tree (cont.) Possible stopping

IntroductionBagging

Boosting

Bootstrap aggregatingRandom forestsExtremely randomized trees

Algorithm 2 Extremely Randomized ForestRequire: A dataset D = {(xi , yi )1≤i≤n}, the size B of the ensemble, the number

m of candidates for splitting.for b = 1 to B do

Grow a random tree using the original dataset:repeat

for all terminal node doSelect m variables among d , at random.for all sampled variables do

Select a split at randomend forPick the best variable and split-point couple among the m candidates.Split the node into two daughter nodes.

end foruntil the stopping criterion is met (e.g., minimum number of sample pernode reached)

end forreturn the ensemble of B trees.

28 / 44

Page 29: Ensemble Methods (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_ensembles.pdf · Decision trees Overview Building a regression tree (cont.) Possible stopping

IntroductionBagging

Boosting

AdaBoostDerivation and partial analysisRestricted functional gradient descent

1 IntroductionDecision treesOverview

2 BaggingBootstrap aggregatingRandom forestsExtremely randomized trees

3 BoostingAdaBoostDerivation and partial analysisRestricted functional gradient descent

29 / 44

Page 30: Ensemble Methods (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_ensembles.pdf · Decision trees Overview Building a regression tree (cont.) Possible stopping

IntroductionBagging

Boosting

AdaBoostDerivation and partial analysisRestricted functional gradient descent

Boosting takes a different route from bagging:bagging: learning in parallel a set of models with low bias andhigh variance (learning being randomized, for example throughbootstrapping), the prediction being made by averaging allthese models.boosting: add sequentially models with high bias and lowvariance such as reducing the bias of the ensemble

For boosting, we will focus on binary classification.

30 / 44

Page 31: Ensemble Methods (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_ensembles.pdf · Decision trees Overview Building a regression tree (cont.) Possible stopping

IntroductionBagging

Boosting

AdaBoostDerivation and partial analysisRestricted functional gradient descent

1 IntroductionDecision treesOverview

2 BaggingBootstrap aggregatingRandom forestsExtremely randomized trees

3 BoostingAdaBoostDerivation and partial analysisRestricted functional gradient descent

31 / 44

Page 32: Ensemble Methods (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_ensembles.pdf · Decision trees Overview Building a regression tree (cont.) Possible stopping

IntroductionBagging

Boosting

AdaBoostDerivation and partial analysisRestricted functional gradient descent

Weighted binary classification

We consider binary classification:the available dataset is of the form D = {(xi , yi )1≤i≤n} withyi ∈ {−1,+1};the empirical risk of interest is

Rn(f ) =1

n

n∑i=1

I{yi 6=f (xi )} =n∑

i=1

1

nI{yi 6=f (xi )};

all samples have the same importance (each sample has aweight 1

n )

Weighted binary classification:we want to associate a different weight wi to each example(xi , yi ), such that

∑i wi = 1 and wi ≥ 0;

the corresponding empirical risk is

Rn(f ) =n∑

i=1

wi I{yi 6=f (xi )}.

32 / 44

Page 33: Ensemble Methods (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_ensembles.pdf · Decision trees Overview Building a regression tree (cont.) Possible stopping

IntroductionBagging

Boosting

AdaBoostDerivation and partial analysisRestricted functional gradient descent

Weighted binary classification (cont.)

For minimizing the weighted empirical risk:

sample a bootstrap replicate according to the discretedistribution (w1, . . . ,wn) and minimize the empirical risk onthis new dataset;do it more directly, in a problem dependent manner.

The case of classification trees:

the node impurity can be easily adaptedfor example, for the misclassification error, consider:

Q(Dm) =∑xi∈Rm

wi I{yi 6=k(m)}.

In the sequel, we assume that a (weak) base learner is available(typically, a decision stump), and that it is able to minimize theweighted risk using the dataset D and weights wi , 1 ≤ i ≤ n.

33 / 44

Page 34: Ensemble Methods (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_ensembles.pdf · Decision trees Overview Building a regression tree (cont.) Possible stopping

IntroductionBagging

Boosting

AdaBoostDerivation and partial analysisRestricted functional gradient descent

Algorithm 3 The AdaBoost (Adaptive Boosting) algorithmRequire: A dataset D = {(xi , yi )1≤i≤n}, the size T of the ensemble.

1: Initialize the weights w1i = 1

n, 1 ≤ i ≤ n

2: for t = 1 to T do3: Fit a binary classifier ft (x) to the training data using weights w t

i .

4: Compute the error εt made by this classifier:

εt =n∑

i=1

w ti I{yi 6=ft (xi )}.

5: Compute the learning rate αt :

αt =1

2ln

(1− εtεt

).

6: Update the weights, for all 1 ≤ i ≤ n:

w t+1i =

w ti e−αt yi ft (xi )∑n

j=1 w tj e−αt yj ft (xj )

.

7: end for8: return the decision rule

HT (x) = sgn (FT (x)) with FT (x) =T∑t=1

αt ft (x).

34 / 44

Page 35: Ensemble Methods (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_ensembles.pdf · Decision trees Overview Building a regression tree (cont.) Possible stopping

IntroductionBagging

Boosting

AdaBoostDerivation and partial analysisRestricted functional gradient descent

1 IntroductionDecision treesOverview

2 BaggingBootstrap aggregatingRandom forestsExtremely randomized trees

3 BoostingAdaBoostDerivation and partial analysisRestricted functional gradient descent

35 / 44

Page 36: Ensemble Methods (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_ensembles.pdf · Decision trees Overview Building a regression tree (cont.) Possible stopping

IntroductionBagging

Boosting

AdaBoostDerivation and partial analysisRestricted functional gradient descent

Forward stagewise additive modeling

A convex surrogate for binary classification is the risk basedon the exponential loss:

Rn(F ) =1

n

n∑i=1

e−yiF (xi ) with F ∈ RX .

We’re looking for an additive model of the form:

FT (x) =T∑t=1

αt ft(x) with ft ∈ {−1,+1}X and αt ∈ R.

The corresponding optimization problem is therefore:

min(αt ,ft)1≤t≤T

1

n

n∑i=1

e−yi∑T

t=1 αt ft(xi ).

Yet, this optimization problem is too complicated.

36 / 44

Page 37: Ensemble Methods (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_ensembles.pdf · Decision trees Overview Building a regression tree (cont.) Possible stopping

IntroductionBagging

Boosting

AdaBoostDerivation and partial analysisRestricted functional gradient descent

Forward stagewise additive modeling (cont.)

A simple alternative is to search for an approximate solutionby sequentially adding basis functions and associate weights:

define F0 = 0;define Ft = Ft−1 + αt ft ;solve sequentially the subproblems

minα,f

1

n

n∑i=1

e−yi (Ft−1(xi )+αf (xi )).

This approach:

is reminiscent of gradient descent (with line search);can be straightforwardly abstracted to any loss function L:

minα,f

1

n

n∑i=1

L(yi ,Ft−1(xi ) + αf (xi )).

37 / 44

Page 38: Ensemble Methods (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_ensembles.pdf · Decision trees Overview Building a regression tree (cont.) Possible stopping

IntroductionBagging

Boosting

AdaBoostDerivation and partial analysisRestricted functional gradient descent

Forward stagewise additive modeling (cont.)

At each iteration t ≥ 1, we have to solve

(αt , ft) = argminα,f

1

n

n∑i=1

e−yi (Ft−1(xi )+αf (xi )).

The solution is given by

ft = argminf ∈H

n∑i=1

w ti I{yi 6=f (xi )} with w t

i =e−yiFt−1(xi )∑nj=1 e−yjFt−1(xj )

and αt =1

2ln

(1− εtεt

)with εt =

n∑i=1

w ti I{yi 6=ft(xi )}.

This is the AdaBoost Algorithm!

38 / 44

Page 39: Ensemble Methods (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_ensembles.pdf · Decision trees Overview Building a regression tree (cont.) Possible stopping

IntroductionBagging

Boosting

AdaBoostDerivation and partial analysisRestricted functional gradient descent

Bounding the empirical risk

Define the edge γt = 12 − εt , which measures how much

better than random guessing (error rate of 12 ) is the error rate

of the tth learned classifier ft

how good is the ensemble HT (x) = sgn(FT (x)) compared toa single weak learner?

Theorem

Write γt = 1− εt the edge of the tth classifier. The empirical riskof the combined classifier HT produced by AdaBoost satisfies

1

n

n∑i=1

I{yi 6=HT (xi )} ≤T∏t=1

√1− 4γ2

t ≤ e−2∑T

t=1 γ2t .

39 / 44

Page 40: Ensemble Methods (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_ensembles.pdf · Decision trees Overview Building a regression tree (cont.) Possible stopping

IntroductionBagging

Boosting

AdaBoostDerivation and partial analysisRestricted functional gradient descent

1 IntroductionDecision treesOverview

2 BaggingBootstrap aggregatingRandom forestsExtremely randomized trees

3 BoostingAdaBoostDerivation and partial analysisRestricted functional gradient descent

40 / 44

Page 41: Ensemble Methods (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_ensembles.pdf · Decision trees Overview Building a regression tree (cont.) Possible stopping

IntroductionBagging

Boosting

AdaBoostDerivation and partial analysisRestricted functional gradient descent

Rationale

Consider the binary classification problem with a convexsurrogate:

we’re looking for a classifier H(x) = sgn(F (x)) with F ∈ RX ;let L(y ,F (x)) be a convex surrogate to the binary loss (forexample, the exponential loss, L(y ,F (x)) = e−yF (x));we would like to minimize the empirical risk:

minF∈RX

Rn(F ) with Rn(F ) =1

n

n∑i=1

L(yi ,F (xi )).

Convex optimization:as a sum of convex function, Rn is convex in F ;a standard approach for minimizing a convex function is toperform a gradient descent:

Ft+1 = Ft − αt∇FRn(Ft).

Problem: F is a function, what is ∇F ? Is it usable?

41 / 44

Page 42: Ensemble Methods (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_ensembles.pdf · Decision trees Overview Building a regression tree (cont.) Possible stopping

IntroductionBagging

Boosting

AdaBoostDerivation and partial analysisRestricted functional gradient descent

Functional gradient and related Hilbert space

Assume that X is measurable and let µ be a probabilitymeasure.

Define L2(X ,R, µ) the set of all equivalence classes offunctions F ∈ RX such that the Lebesgue integral∫X F (x)2dµ(x) is finite.

This Hilbert space has a natural inner product:〈F ,G 〉µ =

∫X F (x)G (x)dµ(x).

For a functional J : L2(X ,R, µ)→ R, the Frechet derivative isthe linear operator ∇J(F ) satisfying

limG→0J(F + G )− J(F )− 〈∇J(F ),G 〉µ

‖G‖µ= 0.

42 / 44

Page 43: Ensemble Methods (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_ensembles.pdf · Decision trees Overview Building a regression tree (cont.) Possible stopping

IntroductionBagging

Boosting

AdaBoostDerivation and partial analysisRestricted functional gradient descent

Back to risk minimization

The probability measure we’re interested in here is the discret measurethat associates a probability 1

nto each input xi . The associated inner

product is

〈F ,G〉n =1

n

n∑i=1

F (xi )G(xi ).

We can compute the Frechet derivative, so we can write the gradientdescent:

Ft+1 = Ft − αt∇FRn(Ft).

However, the Frechet derivative is a set of function, known only in thedatapoints xi . It does not allows generalizing and it is not a practicalobject for computing. The idea is therefore to “restrict” this gradient tothe hypothesis space H of interest, by looking for the function of H beingthe more collinear to the gradient (with a comparable norm)

ft ∈ argmaxf∈H

〈∇FRn(Ft), f 〉n‖f ‖n

.

then, we replace the gradient by its approximation in the update rule:

Ft = Ft−1 − αt ft .

43 / 44

Page 44: Ensemble Methods (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_ensembles.pdf · Decision trees Overview Building a regression tree (cont.) Possible stopping

IntroductionBagging

Boosting

AdaBoostDerivation and partial analysisRestricted functional gradient descent

Application to the exponential loss

Consider the risk based on the exponential loss:

Rn(F ) =1

n

n∑i=1

e−yiF (xi ).

Restricting the gradient gives

argmaxf∈H

〈∇Rn(Ft), f 〉n‖f ‖n

= argmaxf∈H

1

n

n∑i=1

e−yiFt(xi )I{yi 6=f (xi )} = −ft

The optimal learning rate can be obtained by performing a linesearch:

αt = argminα>0

Rn(Ft + αft) =1

2ln

(1− εtεt

)with εt =

∑ni=1 e−yiFt(xi )I{yi 6=ft(xi )}∑n

i=1 e−yiFt(xi ).

This is Adaboost!

44 / 44