Ensemble Methods (Machine Learning,...

IntroductionBagging

Boosting

Ensemble Methods(Machine Learning, SIR)

Matthieu Geist (CentraleSupelec)[email protected]

1 / 44

IntroductionBagging

Boosting

Decision treesOverview

1 IntroductionDecision treesOverview

2 BaggingBootstrap aggregatingRandom forestsExtremely randomized trees

3 BoostingAdaBoostDerivation and partial analysisRestricted functional gradient descent

2 / 44

IntroductionBagging

Boosting


(base) learner: predictions are made based on an estimatorbuilt with a given learning algorithm

ensemble methods: improve generalization and robustness bycombining the predictions of several base learners

Figure : An illustration of bagging.Figure : An illustration of boosting.

3 / 44

IntroductionBagging

Boosting





4 / 44

IntroductionBagging

Boosting


Basic idea

f (x) =5∑

m=1

cmI{x∈Rm}.

Figure : Illustration of a regression tree.5 / 44

IntroductionBagging

Boosting


Building a regression tree

Assume the partition (R1, . . . ,RM) is fixed, the regressionmodel associates a constant to each leaf:

f (x) =M∑

m=1

cmI{x∈Rm}

Minimize the empirical risk based on the `2-loss:

minc1,...,cM

Rn(f ) = minc1,...,cM

1

n

n∑i=1

(yi − f (xi ))2.

Solution obtained by setting the gradient to zero:

cm = ave(yi |xi ∈ Rm) =

∑ni=1 yi I{xi∈Rm}∑ni=1 I{xi∈Rm}

.

6 / 44

IntroductionBagging

Boosting


Building a regression tree (cont.)

How to find the best binary partition in term of Rn?in general, computationally infeasible;idea: use a (heuristic) greedy approach.

Let D be the dataset:

D = {(xi , yi )1≤i≤n} with xi = (xi ,1, . . . , xi ,d)> ∈ Rd .

For a splitting variable j and a split point s, define the pair ofhalf planes

R1(j , s) = {xi ∈ D : xi ,j ≤ s} and R2(j , s) = {xi ∈ D : xi ,j > s}.

Then, search for (j , s) solving

minj ,s

minc1

∑xi∈R1(j ,s)

(yi − c1)2 + minc2

∑xi∈R2(j ,s)

(yi − c2)2

.

7 / 44

IntroductionBagging

Boosting



There is a finite number of (j , s) couples:

d splitting variables: 1 ≤ j ≤ d ;n − 1 split points for each splitting variable j : obtained byordering the j th components.

The inner minimization problem is easily solved for any (j , s):

c1 = ave(yi |xi ∈ R1(j , s)) and c2 = ave(yi |xi ∈ R2(j , s)).

Therefore, by scanning through each dimension of all theinputs, determination of the best pair (j , s) is feasible.

Then, having found the best split, we partition the data intothe two resulting regions and repeat the splitting procedurefor each of these regions.

And so on, until a stopping criterion is reached.

8 / 44

IntroductionBagging

Boosting



Possible stopping criterion:

max. depth;max. number of leaves;max. number of samples per leaf;etc.

Stopping criterion is important:

a very large tree will overfit the data (consider a tree with asmany leaves as samples);a too small tree might not capture the important structure(e.g., decision stump, tree with two nodes).

9 / 44

IntroductionBagging

Boosting


Building a classification tree

classification treey ∈ {1, . . . ,K};the `2-loss is not the best choice, we should consider anothercriteria for splitting.

For a node m representing a region Rm, consider:the number of data points nm =

∑ni=1 I{xi∈Rm};

the associated dataset Dm = {(xi , yi ) ∈ D : xi ∈ Rm};the proportion of class k observations in node m

pm,k =1

nm

∑xi∈Rm

I{yi=k};

the majority class k(m) = argmax1≤k≤K pm,k .

in regression, partitioning was based on the node impurity

Q(Dm) =1

nm

∑xi∈Rm

(yi − cm)2,

what node impurity should one choose for classification?10 / 44

IntroductionBagging

Boosting


Building a classification tree (cont.)

misclassification error,

Q(Dm) =1

nm

∑xi∈Rm

I{yi 6=k(m)}

= 1− pm,k(m);

Gini index,

Q(Dm) =∑k 6=k′

pm,k pm,k′

=K∑

k=1

pm,k(1− pm,k);

cross-entropy,

Q(Dm) = −K∑

k=1

pm,k ln pm,k

0.0 0.2 0.4 0.6 0.8 1.0

p

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

misclassification errorGini indexcross-entropy

Figure : Measure of node impurity forbinary classification, as a function of theproportion p in the second class.

11 / 44

IntroductionBagging

Boosting


Building a classification tree (cont.)

The tree is grown as previously.

For a region Rm:

consider a couple (j , s) of splitting variable and split point;write Dm,L(j , s) the resulting dataset of the left node (of sizenmL

);write Dm,R(j , s) the dataset of the right node (of size nmR

).

The tree is growth by solving

minj ,s

(nmLQ(Dm,L(j , s)) + nmR

Q(Dm,R(j , s))) .

12 / 44

IntroductionBagging

Boosting


Summary

Trees are interpretable predictors, quite easy to train.

Other advantages:

can handle categorical values;can be extended to cost-sensitive learning;can handle missing input values.

However, trees are unstable:

they have a high variance: a small change in the data canresult in a very different tree;mainly caused by the hierarchical nature of the process (theeffect of an error is propagated down);the smaller the tree, the more stable it is, but the weaker thelearner is.

13 / 44

IntroductionBagging

Boosting





14 / 44

IntroductionBagging

Boosting


bagging (or more generally averaging methods):

build in parallel several estimators, more or less independently,and average their predictions;reduce the variance for the combined estimator;applicable for example to deep trees.

boosting:

build sequentially (weak) base estimators such as reducing thebias of the combined estimators;motivation: combine weak learners to form a strong learner;applicable for example to decision stumps.

There exist other ensemble methods (Bayesian modelaveraging, mixture of expert or stacking, for exampe), notaddressed here.

15 / 44

IntroductionBagging

Boosting

Bootstrap aggregatingRandom forestsExtremely randomized trees




16 / 44

IntroductionBagging

Boosting


Motivation

Abstract problem:let X1, . . . ,XB be i.i.d. random variables, of mean µ = E [X1]and of variance σ2 = var(X1) = E

[(X1 − µ)2

];

consider the empirical mean µB = 1B

∑Bb=1 Xb;

the expectation does not change, E [µB ] = µ, while thevariance is reduced, var(µB) = 1

B σ2.

Supervised learning:the random quantity is the dataset D = {(xi , yi )1≤i≤n};from this, an estimate fD is computed by minimizing theempirical risk of interest.

Ideal case:let D1, . . . ,DB be datasets drawn independently, and writefb = fDb

the associated minimizer;

define fave = 1B

∑Bb=1 fb;

this does not change the expectation, E [fave] = E [f1], but itreduces the variance, var(fave) = 1

B var(f1).

Unfortunately, it is not possible to sample datasets ondemand... 17 / 44

IntroductionBagging

Boosting


Illustration

18 / 44

IntroductionBagging

Boosting





19 / 44

IntroductionBagging

Boosting


bagging = bootstrap + aggregating

bootstrap = random sampling with replacement (statistics)

Figure : Illustration of the bootstrapping principle.

20 / 44

IntroductionBagging

Boosting


bagging:bootstrap the dataset to get B datasets Db (same size as D),1 ≤ b ≤ B;learn a predictor fDb

= fb for each of these datasets;average the predictors:

fbag(x) =1

B

B∑b=1

fb(x)

Apply directly for regression.For classification, averaging the predicted classes would notmake sense. Possible strategies:

do a majority vote over trees:

fbag = argmax1≤i≤K

(1

B

B∑b=1

I{fb(x)=k}

).

consider the class proportion for the leave corresponding to theinput of interest (pm,k , with x ∈ Rm) for each tree, averagethem over all trees and output the class that maximizes thisaveraged class proportion.

21 / 44

IntroductionBagging

Boosting


Variations of the bagging approach

Bagging: draw sample with replacement.

Pasting: alternatively, random subsets of the dataset can bedrawn as random subsets of the samples.

Random subspaces: one can also selects randomly a subset ofthe components of the inputs (which is generallymulti-dimensional) to learn models.

Random patches: combination of the two previous methods.

22 / 44

IntroductionBagging

Boosting





23 / 44

IntroductionBagging

Boosting


Motivation

There is necessarily some overlap between bootstrappeddatasets, so the models corresponding to each of thesedatasets are correlated.Abstract problem

let X1, . . . ,XB be i.d. but not independent random variables,of mean µ = E [X1] and of varianceσ2 = var(X1) = E

[(X1 − µ)2

];

consider the empirical mean µB = 1B

∑Bb=1 Xb;

assume a positive pairwise correlation ρ:

for i 6= j , ρ =E [(Xi − µ)(Xj − µ)]√

var(Xi ) var(Xj).

the expectation does not change, E [µB ] = µ, but the varianceis now given by

var(µB) = ρσ2 +1− ρ

Bσ2.

24 / 44

IntroductionBagging

Boosting


Random forest: reduce the correlation between the trees byrandomizing their constructions, without increasing thevariance too much.

Algorithm 1 Random ForestRequire: A dataset D = {(xi , yi )1≤i≤n}, the size B of the ensemble, the number

m of candidates for splitting.for b = 1 to B do

Draw a bootstrap dataset Db of size n from the original training set D.Grow a random tree using the bootstrapped dataset:repeat

for all terminal node doSelect m variables among d , at random.Pick the best variable and split-point couple among the m.Split the node into two daughter nodes.

end foruntil the stopping criterion is met (e.g., minimum number of sample pernode reached)

end forreturn the ensemble of B trees.

25 / 44

IntroductionBagging

Boosting





26 / 44

IntroductionBagging

Boosting


Extremly randomized forest pushed further the randomization.The two differences with random forests are:

split-points are also chosen randomly, one for each randomsplitting dimension;the full learning dataset is used for growing each tree.

The rationale for:

choosing also the split-point at random is to further reduce thecorrelation between trees;using the full learning set is to achieve a lower bias (at theprice of an increased variance, that should be compensated bythe randomization of split-points).

27 / 44

IntroductionBagging

Boosting


Algorithm 2 Extremely Randomized ForestRequire: A dataset D = {(xi , yi )1≤i≤n}, the size B of the ensemble, the number

m of candidates for splitting.for b = 1 to B do

Grow a random tree using the original dataset:repeat

for all terminal node doSelect m variables among d , at random.for all sampled variables do

Select a split at randomend forPick the best variable and split-point couple among the m candidates.Split the node into two daughter nodes.

end foruntil the stopping criterion is met (e.g., minimum number of sample pernode reached)

end forreturn the ensemble of B trees.

28 / 44

IntroductionBagging

Boosting

AdaBoostDerivation and partial analysisRestricted functional gradient descent




29 / 44

IntroductionBagging

Boosting


Boosting takes a different route from bagging:bagging: learning in parallel a set of models with low bias andhigh variance (learning being randomized, for example throughbootstrapping), the prediction being made by averaging allthese models.boosting: add sequentially models with high bias and lowvariance such as reducing the bias of the ensemble

For boosting, we will focus on binary classification.

30 / 44

IntroductionBagging

Boosting





31 / 44

IntroductionBagging

Boosting


Weighted binary classification

We consider binary classification:the available dataset is of the form D = {(xi , yi )1≤i≤n} withyi ∈ {−1,+1};the empirical risk of interest is

Rn(f ) =1

n

n∑i=1

I{yi 6=f (xi )} =n∑

i=1

1

nI{yi 6=f (xi )};

all samples have the same importance (each sample has aweight 1

n )

Weighted binary classification:we want to associate a different weight wi to each example(xi , yi ), such that

∑i wi = 1 and wi ≥ 0;

the corresponding empirical risk is

Rn(f ) =n∑

i=1

wi I{yi 6=f (xi )}.

32 / 44

IntroductionBagging

Boosting


Weighted binary classification (cont.)

For minimizing the weighted empirical risk:

sample a bootstrap replicate according to the discretedistribution (w1, . . . ,wn) and minimize the empirical risk onthis new dataset;do it more directly, in a problem dependent manner.

The case of classification trees:

the node impurity can be easily adaptedfor example, for the misclassification error, consider:

Q(Dm) =∑xi∈Rm

wi I{yi 6=k(m)}.

In the sequel, we assume that a (weak) base learner is available(typically, a decision stump), and that it is able to minimize theweighted risk using the dataset D and weights wi , 1 ≤ i ≤ n.

33 / 44

IntroductionBagging

Boosting


Algorithm 3 The AdaBoost (Adaptive Boosting) algorithmRequire: A dataset D = {(xi , yi )1≤i≤n}, the size T of the ensemble.

1: Initialize the weights w1i = 1

n, 1 ≤ i ≤ n

2: for t = 1 to T do3: Fit a binary classifier ft (x) to the training data using weights w t

i .

4: Compute the error εt made by this classifier:

εt =n∑

i=1

w ti I{yi 6=ft (xi )}.

5: Compute the learning rate αt :

αt =1

2ln

(1− εtεt

).

6: Update the weights, for all 1 ≤ i ≤ n:

w t+1i =

w ti e−αt yi ft (xi )∑n

j=1 w tj e−αt yj ft (xj )

.

7: end for8: return the decision rule

HT (x) = sgn (FT (x)) with FT (x) =T∑t=1

αt ft (x).

34 / 44

IntroductionBagging

Boosting





35 / 44

IntroductionBagging

Boosting


Forward stagewise additive modeling

A convex surrogate for binary classification is the risk basedon the exponential loss:

Rn(F ) =1

n

n∑i=1

e−yiF (xi ) with F ∈ RX .

We’re looking for an additive model of the form:

FT (x) =T∑t=1

αt ft(x) with ft ∈ {−1,+1}X and αt ∈ R.

The corresponding optimization problem is therefore:

min(αt ,ft)1≤t≤T

1

n

n∑i=1

e−yi∑T

t=1 αt ft(xi ).

Yet, this optimization problem is too complicated.

36 / 44

IntroductionBagging

Boosting


Forward stagewise additive modeling (cont.)

A simple alternative is to search for an approximate solutionby sequentially adding basis functions and associate weights:

define F0 = 0;define Ft = Ft−1 + αt ft ;solve sequentially the subproblems

minα,f

1

n

n∑i=1

e−yi (Ft−1(xi )+αf (xi )).

This approach:

is reminiscent of gradient descent (with line search);can be straightforwardly abstracted to any loss function L:

minα,f

1

n

n∑i=1

L(yi ,Ft−1(xi ) + αf (xi )).

37 / 44

IntroductionBagging

Boosting


Forward stagewise additive modeling (cont.)

At each iteration t ≥ 1, we have to solve

(αt , ft) = argminα,f

1

n

n∑i=1

e−yi (Ft−1(xi )+αf (xi )).

The solution is given by

ft = argminf ∈H

n∑i=1

w ti I{yi 6=f (xi )} with w t

i =e−yiFt−1(xi )∑nj=1 e−yjFt−1(xj )

and αt =1

2ln

(1− εtεt

)with εt =

n∑i=1

w ti I{yi 6=ft(xi )}.

This is the AdaBoost Algorithm!

38 / 44

IntroductionBagging

Boosting


Bounding the empirical risk

Define the edge γt = 12 − εt , which measures how much

better than random guessing (error rate of 12 ) is the error rate

of the tth learned classifier ft

how good is the ensemble HT (x) = sgn(FT (x)) compared toa single weak learner?

Theorem

Write γt = 1− εt the edge of the tth classifier. The empirical riskof the combined classifier HT produced by AdaBoost satisfies

1

n

n∑i=1

I{yi 6=HT (xi )} ≤T∏t=1

√1− 4γ2

t ≤ e−2∑T

t=1 γ2t .

39 / 44

IntroductionBagging

Boosting





40 / 44

IntroductionBagging

Boosting


Rationale

Consider the binary classification problem with a convexsurrogate:

we’re looking for a classifier H(x) = sgn(F (x)) with F ∈ RX ;let L(y ,F (x)) be a convex surrogate to the binary loss (forexample, the exponential loss, L(y ,F (x)) = e−yF (x));we would like to minimize the empirical risk:

minF∈RX

Rn(F ) with Rn(F ) =1

n

n∑i=1

L(yi ,F (xi )).

Convex optimization:as a sum of convex function, Rn is convex in F ;a standard approach for minimizing a convex function is toperform a gradient descent:

Ft+1 = Ft − αt∇FRn(Ft).

Problem: F is a function, what is ∇F ? Is it usable?

41 / 44

IntroductionBagging

Boosting


Functional gradient and related Hilbert space

Assume that X is measurable and let µ be a probabilitymeasure.

Define L2(X ,R, µ) the set of all equivalence classes offunctions F ∈ RX such that the Lebesgue integral∫X F (x)2dµ(x) is finite.

This Hilbert space has a natural inner product:〈F ,G 〉µ =

∫X F (x)G (x)dµ(x).

For a functional J : L2(X ,R, µ)→ R, the Frechet derivative isthe linear operator ∇J(F ) satisfying

limG→0J(F + G )− J(F )− 〈∇J(F ),G 〉µ

‖G‖µ= 0.

42 / 44

IntroductionBagging

Boosting


Back to risk minimization

The probability measure we’re interested in here is the discret measurethat associates a probability 1

nto each input xi . The associated inner

product is

〈F ,G〉n =1

n

n∑i=1

F (xi )G(xi ).

We can compute the Frechet derivative, so we can write the gradientdescent:

Ft+1 = Ft − αt∇FRn(Ft).

However, the Frechet derivative is a set of function, known only in thedatapoints xi . It does not allows generalizing and it is not a practicalobject for computing. The idea is therefore to “restrict” this gradient tothe hypothesis space H of interest, by looking for the function of H beingthe more collinear to the gradient (with a comparable norm)

ft ∈ argmaxf∈H

〈∇FRn(Ft), f 〉n‖f ‖n

.

then, we replace the gradient by its approximation in the update rule:

Ft = Ft−1 − αt ft .

43 / 44

IntroductionBagging

Boosting


Application to the exponential loss

Consider the risk based on the exponential loss:

Rn(F ) =1

n

n∑i=1

e−yiF (xi ).

Restricting the gradient gives

argmaxf∈H

〈∇Rn(Ft), f 〉n‖f ‖n

= argmaxf∈H

1

n

n∑i=1

e−yiFt(xi )I{yi 6=f (xi )} = −ft

The optimal learning rate can be obtained by performing a linesearch:

αt = argminα>0

Rn(Ft + αft) =1

2ln

(1− εtεt

)with εt =

∑ni=1 e−yiFt(xi )I{yi 6=ft(xi )}∑n

i=1 e−yiFt(xi ).

This is Adaboost!

44 / 44

Ensemble Methods (Machine Learning,...

Documents

Transcript of Ensemble Methods (Machine Learning,...