New Random Forest · 2020. 7. 26. · Trees vs. Random Forest + Trees yield insight into decision...
Transcript of New Random Forest · 2020. 7. 26. · Trees vs. Random Forest + Trees yield insight into decision...
Random Forest
Applied Multivariate Statistics – Spring 2013
TexPoint fonts used in EMF.
Read the TexPoint manual before you delete this box.: AAAAAAAA
Overview
Intuition of Random Forest
The Random Forest Algorithm
De-correlation gives better accuracy
Out-of-bag error (OOB-error)
Variable importance
1
Diseased
Diseased
Healthy
Healthy
Diseased
Intuition of Random Forest
2
young old
short tall
healthy diseased
young old
diseased
female male
healthy healthy
working retired
healthy
short tall
healthy diseased
New sample:
old, retired, male, short
Tree predictions:
diseased, healthy, diseased
Majority rule:
diseased
healthy
healthy
diseased healthy
Tree 1
Tree 3
Tree 2
The Random Forest Algorithm
3
Differences to standard tree
Train each tree on bootstrap resample of data (Bootstrap resample of data set with N samples:
Make new data set by drawing with replacement N samples; i.e., some samples will
probably occur multiple times in new data set)
For each split, consider only m randomly selected variables
Don’t prune
Fit B trees in such a way and use average or majority
voting to aggregate results
4
Why Random Forest works 1/2
Mean Squared Error = Variance + Bias2
If trees are sufficiently deep, they have very small bias
How could we improve the variance over that of a single
tree?
5
Why Random Forest works 2/2
6
i=j
Decreases, if number of trees B
increases (irrespective of 𝜌)
Decreaes, if
𝜌 decreases, i.e., if
m decreases
De-correlation gives
better accuracy
x
Estimating generalization error:
Out-of bag (OOB) error
Similar to leave-one-out cross-validation, but almost
without any additional computational burden
7
young old
short tall
healthy diseased
diseased healthy
Resampled Data:
old, tall – healthy
old, tall – healthy
old, short – diseased
old, short – diseased
young, tall – healthy
young, tall – healthy
young, short - healthy
Out of bag samples:
young, short – diseased
young, tall– healthy
old, short – diseased
Out of bag (OOB) error rate:
1/3 = 0.33
Data:
old, tall – healthy
old, short – diseased
young, tall – healthy
young, short – healthy
young, short – diseased
young, tall – healthy
old, short– diseased
Variable Importance for variable i
using Permutations
8
Data
…
Resampled
Dataset 1 OOB
Data 1
Resampled
Dataset m OOB
Data m
Tree 1 Tree m
OOB error e1 OOB error em
Permute values of
variable i in OOB
data set
OOB error p1 OOB error pm
d = 1m
Pm
i=1 di
d1 = p1–e1 dm =pm-em
s2d =1
m¡1Pm
i=1(di ¡ d)2vi =
dsd
Trees vs. Random Forest
+ Trees yield insight into
decision rules
+ Rather fast
+ Easy to tune
parameters
- Prediction of trees tend
to have a high variance
9
+ RF has smaller prediction variance and therefore usually a better general performance
+ Easy to tune parameters
- Rather slow
- “Black Box”: Rather hard to get insights into decision rules
Comparing runtime
(just for illustration)
10
RF
Tree
• Up to “thousands” of variables
• Problematic if there are categorical predictors with many levels (max: 32 levels)
RF: First predictor cut into 15 levels
+ Very fast
+ Discriminants for visualizing group separation
+ Can read off decision rule
- Can model only linear class boundaries
- Mediocre performance
- No variable selection
- Only on categorical response
- Needs CV for estimating prediction error
RF vs. LDA
+ Can model nonlinear
class boundaries
+ OOB error “for free” (no
CV needed)
+ Works on continuous and
categorical responses
(regression / classification)
+ Gives variable
importance
+ Very good performance
- “Black box”
- Slow
11
x x
x x x x
x x
x x
x
x
x
x x
x x
x
x
x x
x x x
Concepts to know
Idea of Random Forest and how it reduces the prediction
variance of trees
OOB error
Variable Importance based on Permutation
12
R functions to know
Function “randomForest” and “varImpPlot” from package
“randomForest”
13