New Random Forest · 2020. 7. 26. · Trees vs. Random Forest + Trees yield insight into decision...

14
Random Forest Applied Multivariate Statistics Spring 2013

Transcript of New Random Forest · 2020. 7. 26. · Trees vs. Random Forest + Trees yield insight into decision...

Page 1: New Random Forest · 2020. 7. 26. · Trees vs. Random Forest + Trees yield insight into decision rules + Rather fast + Easy to tune parameters - Prediction of trees tend to have

Random Forest

Applied Multivariate Statistics – Spring 2013

TexPoint fonts used in EMF.

Read the TexPoint manual before you delete this box.: AAAAAAAA

Page 2: New Random Forest · 2020. 7. 26. · Trees vs. Random Forest + Trees yield insight into decision rules + Rather fast + Easy to tune parameters - Prediction of trees tend to have

Overview

Intuition of Random Forest

The Random Forest Algorithm

De-correlation gives better accuracy

Out-of-bag error (OOB-error)

Variable importance

1

Diseased

Diseased

Healthy

Healthy

Diseased

Page 3: New Random Forest · 2020. 7. 26. · Trees vs. Random Forest + Trees yield insight into decision rules + Rather fast + Easy to tune parameters - Prediction of trees tend to have

Intuition of Random Forest

2

young old

short tall

healthy diseased

young old

diseased

female male

healthy healthy

working retired

healthy

short tall

healthy diseased

New sample:

old, retired, male, short

Tree predictions:

diseased, healthy, diseased

Majority rule:

diseased

healthy

healthy

diseased healthy

Tree 1

Tree 3

Tree 2

Page 4: New Random Forest · 2020. 7. 26. · Trees vs. Random Forest + Trees yield insight into decision rules + Rather fast + Easy to tune parameters - Prediction of trees tend to have

The Random Forest Algorithm

3

Page 5: New Random Forest · 2020. 7. 26. · Trees vs. Random Forest + Trees yield insight into decision rules + Rather fast + Easy to tune parameters - Prediction of trees tend to have

Differences to standard tree

Train each tree on bootstrap resample of data (Bootstrap resample of data set with N samples:

Make new data set by drawing with replacement N samples; i.e., some samples will

probably occur multiple times in new data set)

For each split, consider only m randomly selected variables

Don’t prune

Fit B trees in such a way and use average or majority

voting to aggregate results

4

Page 6: New Random Forest · 2020. 7. 26. · Trees vs. Random Forest + Trees yield insight into decision rules + Rather fast + Easy to tune parameters - Prediction of trees tend to have

Why Random Forest works 1/2

Mean Squared Error = Variance + Bias2

If trees are sufficiently deep, they have very small bias

How could we improve the variance over that of a single

tree?

5

Page 7: New Random Forest · 2020. 7. 26. · Trees vs. Random Forest + Trees yield insight into decision rules + Rather fast + Easy to tune parameters - Prediction of trees tend to have

Why Random Forest works 2/2

6

i=j

Decreases, if number of trees B

increases (irrespective of 𝜌)

Decreaes, if

𝜌 decreases, i.e., if

m decreases

De-correlation gives

better accuracy

x

Page 8: New Random Forest · 2020. 7. 26. · Trees vs. Random Forest + Trees yield insight into decision rules + Rather fast + Easy to tune parameters - Prediction of trees tend to have

Estimating generalization error:

Out-of bag (OOB) error

Similar to leave-one-out cross-validation, but almost

without any additional computational burden

7

young old

short tall

healthy diseased

diseased healthy

Resampled Data:

old, tall – healthy

old, tall – healthy

old, short – diseased

old, short – diseased

young, tall – healthy

young, tall – healthy

young, short - healthy

Out of bag samples:

young, short – diseased

young, tall– healthy

old, short – diseased

Out of bag (OOB) error rate:

1/3 = 0.33

Data:

old, tall – healthy

old, short – diseased

young, tall – healthy

young, short – healthy

young, short – diseased

young, tall – healthy

old, short– diseased

Page 9: New Random Forest · 2020. 7. 26. · Trees vs. Random Forest + Trees yield insight into decision rules + Rather fast + Easy to tune parameters - Prediction of trees tend to have

Variable Importance for variable i

using Permutations

8

Data

Resampled

Dataset 1 OOB

Data 1

Resampled

Dataset m OOB

Data m

Tree 1 Tree m

OOB error e1 OOB error em

Permute values of

variable i in OOB

data set

OOB error p1 OOB error pm

d = 1m

Pm

i=1 di

d1 = p1–e1 dm =pm-em

s2d =1

m¡1Pm

i=1(di ¡ d)2vi =

dsd

Page 10: New Random Forest · 2020. 7. 26. · Trees vs. Random Forest + Trees yield insight into decision rules + Rather fast + Easy to tune parameters - Prediction of trees tend to have

Trees vs. Random Forest

+ Trees yield insight into

decision rules

+ Rather fast

+ Easy to tune

parameters

- Prediction of trees tend

to have a high variance

9

+ RF has smaller prediction variance and therefore usually a better general performance

+ Easy to tune parameters

- Rather slow

- “Black Box”: Rather hard to get insights into decision rules

Page 11: New Random Forest · 2020. 7. 26. · Trees vs. Random Forest + Trees yield insight into decision rules + Rather fast + Easy to tune parameters - Prediction of trees tend to have

Comparing runtime

(just for illustration)

10

RF

Tree

• Up to “thousands” of variables

• Problematic if there are categorical predictors with many levels (max: 32 levels)

RF: First predictor cut into 15 levels

Page 12: New Random Forest · 2020. 7. 26. · Trees vs. Random Forest + Trees yield insight into decision rules + Rather fast + Easy to tune parameters - Prediction of trees tend to have

+ Very fast

+ Discriminants for visualizing group separation

+ Can read off decision rule

- Can model only linear class boundaries

- Mediocre performance

- No variable selection

- Only on categorical response

- Needs CV for estimating prediction error

RF vs. LDA

+ Can model nonlinear

class boundaries

+ OOB error “for free” (no

CV needed)

+ Works on continuous and

categorical responses

(regression / classification)

+ Gives variable

importance

+ Very good performance

- “Black box”

- Slow

11

x x

x x x x

x x

x x

x

x

x

x x

x x

x

x

x x

x x x

Page 13: New Random Forest · 2020. 7. 26. · Trees vs. Random Forest + Trees yield insight into decision rules + Rather fast + Easy to tune parameters - Prediction of trees tend to have

Concepts to know

Idea of Random Forest and how it reduces the prediction

variance of trees

OOB error

Variable Importance based on Permutation

12

Page 14: New Random Forest · 2020. 7. 26. · Trees vs. Random Forest + Trees yield insight into decision rules + Rather fast + Easy to tune parameters - Prediction of trees tend to have

R functions to know

Function “randomForest” and “varImpPlot” from package

“randomForest”

13