You, H - Classification Trees and Random Forest

Classification Trees and Random Forest

Huaxin YouDepartment of Statistics and Actuarial Science

University of Central Florida

1

Outline

•Classification Problems.

•Classification Trees.

• Bagging.

• Boosting.

•Random Forest.

2

Classification Problems

Provided a training data set with known classmembership, classification problems involvefinding a classification rule, so that it has highpredictive accuracy on future examples from thesame classes.

Some popular problems include the following:

•Recognize handwritten digits and characters.

• Identify terror suspects in the video ofthousands of people.

•Classify land type (forest, field, dessert, urban)from satellite images.

•Diagnose a patient’s disease based onphysiological data collected.

• Prevent credit card fraud or network intruders.

3

An Example: Credit Risk Analysis

Data:Customer103: Customer103: Customer103:(time=t0) (time=t1) (time=tn)...

...

Own House: Yes

Other delinquent accts: 2

Loan balance: $2,400

Income: $52k

Max billing cycles late: 3

Years of credit: 9

Profitable customer?: ?

...

Own House: Yes

Years of credit: 9

Profitable customer?: ?

...

Own House: Yes

Years of credit: 9


Income: ?




Income: ?



Profitable customer?: No

Rules learned from synthesized data:

If Other-Delinquent-Accounts > 2, andNumber-Delinquent-Billing-Cycles > 1

Then Profitable-Customer? = No[Deny Credit Card application]

If Other-Delinquent-Accounts = 0, and(Income > $30k) OR (Years-of-Credit > 3)

Then Profitable-Customer? = Yes[Accept Credit Card application]

4

An Example of Classification Trees

Should a tennis match be played?

Outlook

Overcast

Humidity

NormalHigh

No Yes

Wind

Strong Weak

No Yes

Yes

RainSunny

5

Classification Trees

There are many algorithms proposed to generateclassification trees. For example: ID3, C4.5 (R.Quinlan (1993)), CART (L. Breiman et al(1984)). They all focus on constructing a tree-likeclassification rule based on a given training set.

These methods partition the space recursively, sothat the impurity is reduced gradually. Theimpurity can be measured by:

Ψ = − k∑

i=1pi log pi,

whereΨ is also calledEntropy, pi’s denote theproportion of examples fromith class. Note thatif in a subspace, all examples are from the sameclass,Ψ = 0, whereasΨ is maximized whenexamples are evenly from all classes.

6

Top-Down Induction of Decision Trees

Main loop:

1.Find the “best” decision attributeA for nextnode such that the impurity is reduced most.

2.AssignA as decision attribute fornode.

3.For each value or partition ofA, create newdescendant ofnode.

4.Sort training examples to leaf nodes.

5. If training examples perfectly classified, ThenSTOP, Else iterate over new leaf nodes.

Which attribute is best?

A1=? A2=?

ft ft

[29+,35-] [29+,35-]

[21+,5-] [8+,30-] [18+,33-] [11+,2-]

7

Bagging Trees

Bagging: train classification trees onbootstrapped samples (L. Breiman (1998)).

•Given a training set of sizen (big bag), createm different training sets (small bags) bysampling from the original data withreplacement.

• Build m classification trees by trainingclassification tree algorithm on thesemtraining sets.

• Aggregate the prediction by simple majorityvote.

8

Boosting Trees

Boosting: train classification trees on sequentiallyreweighted versions of training dataset (R.Schapire et al. (1997) and J. Friedman et al.(1998)).

• Train the first classification tree.

•Data points are given different weights. A newclassification tree is trained to focus on the datapoints previous classification tree gets wrong.

•During testing, each classification tree gets aweighted vote proportional to its accuracy onthe training data.

9

Research Results

Compared to a single tree,

• Bagging consistently provides a modest gain.

• Boosting generally provides a largerimprovement than Bagging.

• Both Bagging and Boosting increase thetraining cost.

A = {portion of population misclassified byt1},B = {portion of population misclassified byt2},C = {portion of population misclassified byt3}.

10

Bagging Trees and Variance Reduction

Classification trees are very sensitive to the smallchanges in examples. Two similar samples takenfrom the same populations can result in two verydifferent classification trees. In statisticalterminology, they have big variance. Therefore,voting many trees constructed from many smallbags of example reduces the variance, stabilizesthe performance of trees.

Note that Bagging mainly reduces variance of aclassification method. If the method isintrinsically faulted (i.e., has big bias), Baggingwon’t work.

11

Boosting Trees and Bayes Rule

Ideally, a random forestF tries to minimize thegeneralization error:

P (Error) = EI [y(x) 6= F (x)],

wherey(x), F (x) are the actual class andpredicted class ofx, respectively. Given only atraining set(xi, yi)’s,

P (Error|S) =n∑

i=1I [y(xi) 6= F (xi)].

In practice,P (Error|S) is not continuous andhard to minimize. What Boosting does ismaximize the expected value of marginsMi’s.FunctionM(.) is usually continuous and easier tomaximize.M(.) andP (Error) have the samepopulation optimizer – Bayes Rule:

Assignx to classi if: p(i|x) > p(j|x),∀j 6= i

12

Summary: Random Forests

What is a random forest?

• A huge ensemble of trees generated in somefashion.

• The decision rule is produced by voting.

Computational cost concerns:

• The training of many trees is very expensivefor large or high-dimensional datasets.

• The storage of many trees can be an issue.

Remedies:

•When splitting nodes, instead of choosingoptimal split among all variables, one onlychooses the optimal split among a randomlyselected subset of variables.

•Construct short trees, avoid deep trees thatrequire exponentially more searching.

13

Avoid Overfitting

Overfitting occurs when the training error is verysmall whereas the generalization error becomesvery large.

How to avoid overfitting when growing a randomforest?

• Stop growing statistically insignificantbranches.

• Prune the decision tree using some validationset of examples.

It is shown in L. Breiman (2001) that thegeneralization error converges for a properlygrown random forest, therefore the problem ofoverfitting is avoided.

14

Possible Final Project

•How to construct trees with multi-split, insteadof only binary split? See Wei-yin Loh (2001)for more discussion.

•How to merge a random forest into a treeefficiently? If a forest hask tree each withmterminal nodes, a straightforward way cangenerate a tree of sizeO(mk). Is that a way toreduce it toO(m× k)?

•How to use missing information in the trainingset?

•How to reduce the effect of noise variables andnoise examples?

•How to determine good stopping criterion?

•How to use the same techniques on otherclassification methods such as neural networks,support vector machines, etc?

15

Major Contributors

• Leo Breiman: UC Berkeley.

• Jerry Friedman, Trevor Hastie, and RobertTibshirani: Stanford University.

•Robert Schapire and Yoav Freund: AT& T lab.

•Ross Quinlan: the University of New SouthWales.

Sites for Software download:

• http : //stat−www.berkeley.edu/users/breiman/.

• http : //www − stat.stanford.edu/ ∼ jhf/.

• http : //www.research.att.com/ ∼schapire/.

• http : //www.cse.unsw.edu.au/ ∼ quinlan/.

16

You, H - Classification Trees and Random Forest

Documents

Transcript of You, H - Classification Trees and Random Forest