BSAN 400 Introduction to Machine Learninghomepages.uc.edu › ~lis6 › Teaching › ML19Spring ›...

29

Transcript of BSAN 400 Introduction to Machine Learninghomepages.uc.edu › ~lis6 › Teaching › ML19Spring ›...

Page 1: BSAN 400 Introduction to Machine Learninghomepages.uc.edu › ~lis6 › Teaching › ML19Spring › Lecture › 4_tree.pdfIntro to ML Lecture 5. ree-basedT Methods 23/28. Gradient

BSAN 400 Introduction to Machine Learning

Lecture 5. Tree-Based Methods1

Shaobo Li

University of Kansas

1Partially based on Hastie, et al. (2009) ESL, and James, et al. (2013) ISLR

Intro to ML Lecture 5. Tree-based Methods 1 / 28

Page 2: BSAN 400 Introduction to Machine Learninghomepages.uc.edu › ~lis6 › Teaching › ML19Spring › Lecture › 4_tree.pdfIntro to ML Lecture 5. ree-basedT Methods 23/28. Gradient

Overview

Supervised learning

Simple for interpretation

Nonparametric method

We will talk about

Classi�cation and Regression Tree (CART)

Bagging, Random Forest, and Boosting

Intro to ML Lecture 5. Tree-based Methods 2 / 28

Page 3: BSAN 400 Introduction to Machine Learninghomepages.uc.edu › ~lis6 › Teaching › ML19Spring › Lecture › 4_tree.pdfIntro to ML Lecture 5. ree-basedT Methods 23/28. Gradient

A Simple Example of Regression Tree

Predicting baseball player's log-salary using- X1: number of years he has played in the major leagues;- X2: number of hits he made in previous year

Intro to ML Lecture 5. Tree-based Methods 3 / 28

Page 4: BSAN 400 Introduction to Machine Learninghomepages.uc.edu › ~lis6 › Teaching › ML19Spring › Lecture › 4_tree.pdfIntro to ML Lecture 5. ree-basedT Methods 23/28. Gradient

Prediction

The values in terminal nodes

Top-down

At each splitting point, go left if TRUE

Intro to ML Lecture 5. Tree-based Methods 4 / 28

Page 5: BSAN 400 Introduction to Machine Learninghomepages.uc.edu › ~lis6 › Teaching › ML19Spring › Lecture › 4_tree.pdfIntro to ML Lecture 5. ree-basedT Methods 23/28. Gradient

Data Partition � Strati�cation

Figure: Salary is color-coded from low (blue, green) to high (yellow, red)

Intro to ML Lecture 5. Tree-based Methods 5 / 28

Page 6: BSAN 400 Introduction to Machine Learninghomepages.uc.edu › ~lis6 › Teaching › ML19Spring › Lecture › 4_tree.pdfIntro to ML Lecture 5. ree-basedT Methods 23/28. Gradient

Data Partition � Strati�cation

Intro to ML Lecture 5. Tree-based Methods 6 / 28

Page 7: BSAN 400 Introduction to Machine Learninghomepages.uc.edu › ~lis6 › Teaching › ML19Spring › Lecture › 4_tree.pdfIntro to ML Lecture 5. ree-basedT Methods 23/28. Gradient

CART

Classi�cation and regression tree

Introduced by Leo Breiman et. al, 1984.

Binary splitting for variable j at point s.

Top-down and Greedy algorithm- Start from the top of tree- The split is optimal at current step

Question:

How to determine the optimal splitting point?

What criteria should we use?

When to stop growing the tree?

Intro to ML Lecture 5. Tree-based Methods 7 / 28

Page 8: BSAN 400 Introduction to Machine Learninghomepages.uc.edu › ~lis6 › Teaching › ML19Spring › Lecture › 4_tree.pdfIntro to ML Lecture 5. ree-basedT Methods 23/28. Gradient

Regression Tree

For each split, we try to determine two regions R1 and R2 by�nding the optimal splitting point Xj = s, that is

R1 = {Data|Xj ≤ s} and R2 = {Data|Xj > s}

This is done by scanning all possible splitting point for each X .

Such optimal splitting point minimizes residual sum squares∑i∈R1

(yi − yR1)2 +∑i∈R2

(yi − yR2)2

where yR1 and yR2 are average of Y in each region.

Repeat this process for each region until the decrease of RSS isnot signi�cant.

Intro to ML Lecture 5. Tree-based Methods 8 / 28

Page 9: BSAN 400 Introduction to Machine Learninghomepages.uc.edu › ~lis6 › Teaching › ML19Spring › Lecture › 4_tree.pdfIntro to ML Lecture 5. ree-basedT Methods 23/28. Gradient

Stopping Strategy � Cost-Complexity Pruning

To avoid over�tting!

Suppose T0 is a very large tree (over�tted tree). Denote T as asubtree of T0, and |T | as the size of that tree (number ofterminal nodes). We minimize the cost complexity criterion

Cα(T ) =

|T |∑m=1

RSSm + α|T |,

where RSSm is residual sum squares in region (terminal) m.

Larger α penalizes the tree size (|T |) more- Recall how do we use AIC/BIC for variable selection- Every α corresponds to a unique optimal tree Tα.

We use cross-validation to choose the optimal α.

Intro to ML Lecture 5. Tree-based Methods 9 / 28

Page 10: BSAN 400 Introduction to Machine Learninghomepages.uc.edu › ~lis6 › Teaching › ML19Spring › Lecture › 4_tree.pdfIntro to ML Lecture 5. ree-basedT Methods 23/28. Gradient

Pruning a Tree in R

cp is the complexity parameter α we have discussed.

Intro to ML Lecture 5. Tree-based Methods 10 / 28

Page 11: BSAN 400 Introduction to Machine Learninghomepages.uc.edu › ~lis6 › Teaching › ML19Spring › Lecture › 4_tree.pdfIntro to ML Lecture 5. ree-basedT Methods 23/28. Gradient

Classi�cation Tree

Outcome is categorical variable

The same idea as for regression tree

The only di�erence is the splitting criteria

What should we use for the criteria?

Intro to ML Lecture 5. Tree-based Methods 11 / 28

Page 12: BSAN 400 Introduction to Machine Learninghomepages.uc.edu › ~lis6 › Teaching › ML19Spring › Lecture › 4_tree.pdfIntro to ML Lecture 5. ree-basedT Methods 23/28. Gradient

Splitting Criteria for Classi�cation Tree

Misclassi�cation rate

Gini index

Gini index =∑k 6=k ′

pk pk ′ =K∑

k=1

pk(1− pk)

where pk is the proportion of k category observation in one node.

Entropy

Entropy = −K∑

k=1

pk log pk

For each split, gini index (or entropy) is calculated for two regions,and the weighted average is used as the criterion to be minimized.

Intro to ML Lecture 5. Tree-based Methods 12 / 28

Page 13: BSAN 400 Introduction to Machine Learninghomepages.uc.edu › ~lis6 › Teaching › ML19Spring › Lecture › 4_tree.pdfIntro to ML Lecture 5. ree-basedT Methods 23/28. Gradient

Illustration

Intro to ML Lecture 5. Tree-based Methods 13 / 28

Page 14: BSAN 400 Introduction to Machine Learninghomepages.uc.edu › ~lis6 › Teaching › ML19Spring › Lecture › 4_tree.pdfIntro to ML Lecture 5. ree-basedT Methods 23/28. Gradient

Example

Suppose a node contains binary outcome (0 or 1) with each 400observations.

Split strategy 1: (300, 100) and (100, 300)

Split strategy 2: (400, 200) and (0, 200)

What is the misclassi�cation rate, Gini index, and Entropy foreach split strategy?

Intro to ML Lecture 5. Tree-based Methods 14 / 28

Page 15: BSAN 400 Introduction to Machine Learninghomepages.uc.edu › ~lis6 › Teaching › ML19Spring › Lecture › 4_tree.pdfIntro to ML Lecture 5. ree-basedT Methods 23/28. Gradient

Categorical Predictor

If a categorical predictor has q possible values, how manypossible splits?

Computationally infeasible for large q.

Solution for binary outcome:- Sort the q classes according to the proportion of "1" in each ofthe q classes.- Split the ordered variable as if it is numeric variable.- This split is the optimal in terms of Gini index and Entropy.

But we should avoid categorical variable which has too manycategories. Why?

Intro to ML Lecture 5. Tree-based Methods 15 / 28

Page 16: BSAN 400 Introduction to Machine Learninghomepages.uc.edu › ~lis6 › Teaching › ML19Spring › Lecture › 4_tree.pdfIntro to ML Lecture 5. ree-basedT Methods 23/28. Gradient

Asymmetric Cost

Similar idea as we discussed in last lecture

De�ne a loss matrix with di�erent weights in misclassi�ed cells.

Pred=1 Pred=0

True=1 0 w0

True=0 w1 0

Loan application example (1=default): w1 < w0

The weights are incorporated during each split

Intro to ML Lecture 5. Tree-based Methods 16 / 28

Page 17: BSAN 400 Introduction to Machine Learninghomepages.uc.edu › ~lis6 › Teaching › ML19Spring › Lecture › 4_tree.pdfIntro to ML Lecture 5. ree-basedT Methods 23/28. Gradient

Bagging

Limitation of a single tree: high variance, unstable- Change of points in the sample would lead to di�erent split.- Current split heavily relies on previous splits.

Bagging � Bootstrap aggregation

fbag(x) =1

B

B∑b=1

fb(x)

- Leo Breiman (1996)- Fit many trees on bootstrap samples, and then take average.- Variance can be signi�cantly reduced

For classi�cation tree, how to take average?

Intro to ML Lecture 5. Tree-based Methods 17 / 28

Page 18: BSAN 400 Introduction to Machine Learninghomepages.uc.edu › ~lis6 › Teaching › ML19Spring › Lecture › 4_tree.pdfIntro to ML Lecture 5. ree-basedT Methods 23/28. Gradient

Out-of-bag (OOB) Prediction

Similar to cross-validation

A type of testing error � to prevent over�tting

For every bootstrapped sample, it is used as training sample,while the rest is used as testing sample. OOB is the averagederror of such testing sample.

Intro to ML Lecture 5. Tree-based Methods 18 / 28

Page 19: BSAN 400 Introduction to Machine Learninghomepages.uc.edu › ~lis6 › Teaching › ML19Spring › Lecture › 4_tree.pdfIntro to ML Lecture 5. ree-basedT Methods 23/28. Gradient

Random Forests

Improvement of Bagging- The trees in Bagging are correlated to each other- It may add variance and prediction error

Goal: decorrelate a series of trees while maintain strength

How RF works: randomly choose m predictors as candidatesplitting variables for each split

Choice of m- Smaller m decreases the correlation, but also decreases strength- Larger m increases the correlation, but also increases strength- Use OOB error to determine the optimal m- By default, m ≈ √p for classi�cation and p/3 for regression

See the inventor's (Leo Breiman) website for more details.

Intro to ML Lecture 5. Tree-based Methods 19 / 28

Page 20: BSAN 400 Introduction to Machine Learninghomepages.uc.edu › ~lis6 › Teaching › ML19Spring › Lecture › 4_tree.pdfIntro to ML Lecture 5. ree-basedT Methods 23/28. Gradient

Illustration of De-correlation

Intro to ML Lecture 5. Tree-based Methods 20 / 28

Page 21: BSAN 400 Introduction to Machine Learninghomepages.uc.edu › ~lis6 › Teaching › ML19Spring › Lecture › 4_tree.pdfIntro to ML Lecture 5. ree-basedT Methods 23/28. Gradient

Change of Testing Error across m

Intro to ML Lecture 5. Tree-based Methods 21 / 28

Page 22: BSAN 400 Introduction to Machine Learninghomepages.uc.edu › ~lis6 › Teaching › ML19Spring › Lecture › 4_tree.pdfIntro to ML Lecture 5. ree-basedT Methods 23/28. Gradient

Gradient Boosting

Ensemble method

Based on small trees (weak learner)

Additive � adding up all small trees

Pros:

Powerful algorithm, high prediction accuracySmall variance (advantage of ensemble methods)

Cons:

Computational expensive for large dataLack of interpretabilityMay over�t data

Intro to ML Lecture 5. Tree-based Methods 22 / 28

Page 23: BSAN 400 Introduction to Machine Learninghomepages.uc.edu › ~lis6 › Teaching › ML19Spring › Lecture › 4_tree.pdfIntro to ML Lecture 5. ree-basedT Methods 23/28. Gradient

Boosting Regression Tree

Additive: sequentially grow the tree

Combination of many small trees

Boosting for regression tree

1 Set f (x) = 0, and ri = yi2 For b = 1, . . . ,B, repeat:

Fit a small tree f (b)(x) with d splits based on the training sampleUpdate residual ri ← ri − λf (b)(x), and treat this residual as newresponse, where λ is shrinkage parameterUpdate the tree f (x)← f (x) + λf (b)(x)

3 Output the boosted regression tree

f (x) =B∑

b=1

λf (b)(x)

Intro to ML Lecture 5. Tree-based Methods 23 / 28

Page 24: BSAN 400 Introduction to Machine Learninghomepages.uc.edu › ~lis6 › Teaching › ML19Spring › Lecture › 4_tree.pdfIntro to ML Lecture 5. ree-basedT Methods 23/28. Gradient

Gradient Boosting Machine (GBM)

A more generalized version.

User supplied di�erentiable loss function (e.g. squared loss)

n∑i=1

L(yi , f (xi )) =n∑

i=1

(yi − f (xi ))2

The negative gradient for each L(yi , f (xi )) is essentially residualor pseudo residual.

Then �t a new regression tree to approximate the gradient (�tthe residual).

For binary classi�cation, the loss can be binomial deviance(equivalent to cross entropy, see ESL p346), but the weak learnerin each iteration is still a regression tree that approximate thegradient of the deviance (yi − pi ).

Intro to ML Lecture 5. Tree-based Methods 24 / 28

Page 25: BSAN 400 Introduction to Machine Learninghomepages.uc.edu › ~lis6 › Teaching › ML19Spring › Lecture › 4_tree.pdfIntro to ML Lecture 5. ree-basedT Methods 23/28. Gradient

The algorithm (ESL p387)

Intro to ML Lecture 5. Tree-based Methods 25 / 28

Page 26: BSAN 400 Introduction to Machine Learninghomepages.uc.edu › ~lis6 › Teaching › ML19Spring › Lecture › 4_tree.pdfIntro to ML Lecture 5. ree-basedT Methods 23/28. Gradient

AdaBoost � Boosting for Classi�cation

Originally designed for classi�cation Y = {−1, 1}Algorithm:

1 Assign equal weights to all observations wi = 1/N2 For b = 1, . . . ,B, repeat:

Fit a classi�er Gb(x) to the training sample using weights wi

Compute weighted misclassi�cation error

errb =

∑Ni=1 wi I(yi 6= Gb(xi ))∑N

i=1 wi

Compute αb = log((1− errb)/errb)Update weights wi ← wi exp{αbI(yi 6= Gb(xi ))}, and normalizesuch that the sum equals to 1.

3 Output G (x) = sign{∑B

b=1αbGb(x)}.

Here is the StatQuest video that may help you to understand the idea.

Intro to ML Lecture 5. Tree-based Methods 26 / 28

Page 27: BSAN 400 Introduction to Machine Learninghomepages.uc.edu › ~lis6 › Teaching › ML19Spring › Lecture › 4_tree.pdfIntro to ML Lecture 5. ree-basedT Methods 23/28. Gradient

Peformance Comparison

Intro to ML Lecture 5. Tree-based Methods 27 / 28

Page 28: BSAN 400 Introduction to Machine Learninghomepages.uc.edu › ~lis6 › Teaching › ML19Spring › Lecture › 4_tree.pdfIntro to ML Lecture 5. ree-basedT Methods 23/28. Gradient

XGBoost � Extreme Gradient Boosting

Scalable tree boosting

Recently developed by Tianqi Chen in 2016.

One of the best supervised learning algorithms (MachineLearning Challenge Winning Solutions)

Intro to ML Lecture 5. Tree-based Methods 28 / 28

Page 29: BSAN 400 Introduction to Machine Learninghomepages.uc.edu › ~lis6 › Teaching › ML19Spring › Lecture › 4_tree.pdfIntro to ML Lecture 5. ree-basedT Methods 23/28. Gradient

General idea

We omit the mathematical details here. For more detail, please seethe o�cial tutorial, and original paper.

Similar to traditional gradient tree boosting.

Added regularization which prevents over�tting.

Optimized the utility of computing power.

Intro to ML Lecture 5. Tree-based Methods 29 / 28