Lecture6 - C4.5

24
Introduction to Machine Introduction to Machine Learning Learning Lecture 6 Albert Orriols i Puig il@ ll l d aorriols@salle.url.edu Artificial Intelligence – Machine Learning Enginyeria i Arquitectura La Salle Universitat Ramon Llull

description

 

Transcript of Lecture6 - C4.5

Page 1: Lecture6 - C4.5

Introduction to MachineIntroduction to Machine LearningLearning

Lecture 6

Albert Orriols i Puigi l @ ll l [email protected]

Artificial Intelligence – Machine LearningEnginyeria i Arquitectura La Salleg y q

Universitat Ramon Llull

Page 2: Lecture6 - C4.5

Recap of Lecture 4ID3 is a strong system thatg y

Uses hill-climbing search based on the information gain measure to search through the space of decision treeseasu e o sea c oug e space o dec s o ees

Outputs a single hypothesis.

N b kt k It t l ll ti l l tiNever backtracks. It converges to locally optimal solutions.

Uses all training examples at each step, contrary to methods th t k d i i i t llthat make decisions incrementally.

Uses statistical properties of all examples: the search is less iti t i i di id l t i i lsensitive to errors in individual training examples.

Can handle noisy data by modifying its termination criterion to t h th th t i f tl fit th d taccept hypotheses that imperfectly fit the data.

Slide 2Artificial Intelligence Machine Learning

Page 3: Lecture6 - C4.5

Recap of Lecture 4

However, ID3 has some drawbacksI l d l i h i l dIt can only deal with nominal data

It is not able to deal with noisy data sets

It may be not robust in presence of noise

Slide 3Artificial Intelligence Machine Learning

Page 4: Lecture6 - C4.5

Today’s Agenda

Going from ID3 to C4.5How C4.5 enhances C4.5 to

Be robust in the presence of noise. Avoid overfitting

Deal with continuous attributes

Deal with missing data

Convert trees to rules

Slide 4Artificial Intelligence Machine Learning

Page 5: Lecture6 - C4.5

What’s Overfitting?Overfitting = Given a hypothesis space H, a hypothesis hєH is said to overfit the training data if there exists some alternative hypothesis h’єH, such that

1 h has smaller error than h’ over the training examples but1. h has smaller error than h over the training examples, but 2. h’ has a smaller error than h over the entire distribution of instances.

Slide 5Artificial Intelligence Machine Learning

Page 6: Lecture6 - C4.5

Why May my System Overfit?

In domains with noise or uncertaintyythe system may try to decrease the training error by completely fitting all the training examplesg a e a g e a p es

The learner overfits

Noisy instancesto correctly classify the noisy instances

Occam’s razor: Prefer the simplest hypothesis that fits the data with high accuracythe data with high accuracy

Slide 6Artificial Intelligence Machine Learning

Page 7: Lecture6 - C4.5

How to Avoid Overfitting?Ok, my system may overfit… Can I avoid it?, y y y

Sure! Do not include branches that fit data too specifically

H ?How?1. Pre-prune: Stop growing a branch when information becomes

li blunreliable

2. Post-prune: Take a fully-grown decision tree and discard li blunreliable parts

Slide 7Artificial Intelligence Machine Learning

Page 8: Lecture6 - C4.5

Pre-pruningBased on statistical significance testg

Stop growing the tree when there is no statistically significant association between any attribute and the class at a particular assoc a o be ee a y a bu e a d e c ass a a pa cu anode

Use all available data for training and apply the statistical test Use a a a ab e da a o a g a d app y e s a s ca esto estimate whether expanding/pruning a node is to produce an improvement beyond the training set

Most popular test: chi-squared test

ID3 used chi-squared test in addition to information gainID3 used chi-squared test in addition to information gainOnly statistically significant attributes were allowed to be selected by information gain procedureselected by information gain procedure

Slide 8Artificial Intelligence Machine Learning

Page 9: Lecture6 - C4.5

Pre-pruningEarly stopping: Pre-pruning may stop the growth process prematurely

Classic example: XOR/Parity-problem

No individual attribute exhibits any significant association to the classNo individual attribute exhibits any significant association to the class

Structure is only visible in fully expanded tree

Pre pruning won’t expand the root nodePre-pruning won t expand the root node

But: XOR-type problems rare in practice

And: pre-pruning faster than post-pruning

x1 x2 Classx1 x2 Class

1 0 0 0

2 0 1 101 10

3 1 0 1

4 1 1 0 00 10

Slide 9Artificial Intelligence Machine Learning

Page 10: Lecture6 - C4.5

Post-pruningFirst, build the full tree,

Then, prune itFully grown tree shows all attribute interactionsFully-grown tree shows all attribute interactions

Problem: some subtrees might be due to chance effects

Two pruning operations: 1. Subtree replacement

2. Subtree raising

Possible strategies:Possible strategies:error estimation

i ifi t tisignificance testing

MDL principle

Slide 10Artificial Intelligence Machine Learning

Page 11: Lecture6 - C4.5

Subtree ReplacementBottom up approachp pp

Consider replacing a tree after considering all its subtrees

Ex: labor negotiationsEx: labor negotiations

Slide 11Artificial Intelligence Machine Learning

Page 12: Lecture6 - C4.5

Subtree ReplacementAlgorithm:1. Split the data into training and validation set2. Do until further pruning is harmful:

a. Evaluate impact on the validation set of pruning each possible node

b S l t th d h l t ib. Select the node whose removal most increases the validation set accuracy

Slide 12Artificial Intelligence Machine Learning

Page 13: Lecture6 - C4.5

Subtree RaisingDelete nodeRedistribute instancesSlower than subtreereplacementreplacement(Worthwhile?)

X

Slide 13Artificial Intelligence Machine Learning

Page 14: Lecture6 - C4.5

Estimating Error RatesOk we can prune. But when?p

Prune only if it reduces the estimated error

Error on the training data is NOT a useful estimatorError on the training data is NOT a useful estimatorQ: Why it would result in very little pruning?

Use hold-out set for pruning T i iUse hold out set for pruningSeparate a validation setUse this validation set to Training

Training Data set’

test the improvement

C4 5’s method

gData set

Validation setC4.5 s method

Derive confidence interval from training dataUse a heuristic limit derived from this for pruning

set

Use a heuristic limit, derived from this, for pruningStandard Bernoulli-process-based methodShaky statistical assumptions (based on training data)

Slide 14

y p ( g )

Artificial Intelligence Machine Learning

Page 15: Lecture6 - C4.5

Deal with continuous attributes

When dealing with nominal datagWe evaluated the grain for each possible value

I ti d t h i fi it lIn continuous data, we have infinite values.

What should we do?Continuous-valued attributes may take infinite values, but we have a limited number of values in our instances (at most N if we have N instances)

Therefore, simulate that you have N nominal valuesEvaluate information gain for every possible split point of the attributeChoose the best split pointThe information gain of the attribute is the information gain of the best split

Slide 15

of the best split

Artificial Intelligence Machine Learning

Page 16: Lecture6 - C4.5

Deal with continuous attributes

Example

Outlook Temperature Humidity Windy Play

Sunny 85 85 False Noy

Sunny 80 90 True No

Overcast 83 86 False Yes

Rainy 75 80 False Yes

… … … … …

Continuous attributes

Slide 16Artificial Intelligence Machine Learning

Page 17: Lecture6 - C4.5

Deal with continuous attributes

Split on temperature attribute:Split on temperature attribute:

64 65 68 69 70 71 72 72 75 75 80 81 83 85

Y N Y Y Y N N Y Y Y N Y Y N

E.g.: temperature < 71.5: yes/4, no/2

Yes No Yes Yes Yes No No Yes Yes Yes No Yes Yes No

g te pe atu e 5 yes/ , o/temperature ≥ 71.5: yes/5, no/3

Info([4,2],[5,3]) = 6/14 info([4,2]) + 8/14 info([5,3]) = 0.939 bits

Place split points halfway between valuesCan evaluate all split points in one pass!

Slide 17Artificial Intelligence Machine Learning

Page 18: Lecture6 - C4.5

Deal with continuous attributes

To speed upp pEntropy only needs to be evaluated between points of different classesc asses

64 65 68 69 70 71 72 72 75 75 80 81 83 85value 64 65 68 69 70 71 72 72 75 75 80 81 83 85

Yes No Yes Yes Yes No No Yes Yes Yes No Yes Yes No

valueclass X

Potential optimal breakpointsPotential optimal breakpoints

Breakpoints between values of the same class cannotbe optimalbe optimal

Slide 18Artificial Intelligence Machine Learning

Page 19: Lecture6 - C4.5

Deal with Missing DataTreat missing values as a separate valueg p

Missing value denoted “?” in C4.X

Simple idea: treat missing as a separate valueSimple idea: treat missing as a separate value

Q: When this is not appropriate?

A Wh l i i d diffA: When values are missing due to different reasons Example 1: gene expression could be missing when it is very high or very lowhigh or very low Example 2: field IsPregnant=missing for a male patient should be treated differently (no) than for a female patient of age 25 (unknown)(unknown)

Slide 19Artificial Intelligence Machine Learning

Page 20: Lecture6 - C4.5

Deal with Missing DataSplit instances with missing values into piecesSplit instances with missing values into pieces

A piece going down a branch receives a weight proportional to the popularity of the branchthe popularity of the branch

weights sum to 1

Info gain works with fractional instancesUse sums of weights instead of counts

During classification, split the instance into pieces in the same wayin the same way

Merge probability distribution using weights

Slide 20Artificial Intelligence Machine Learning

Page 21: Lecture6 - C4.5

From Trees to RulesI finally got a tree from domains withy g

Noisy instances

Mi i lMissing values

Continuous attributes

But I prefer rules…No context dependentNo context dependent

Procedure Generate a rule for each tree

Get context-independent rules

Slide 21Artificial Intelligence Machine Learning

Page 22: Lecture6 - C4.5

From Trees to RulesA procedure a little more sophisticated: C4.5Rulesp p

C4.5rules: greedily prune conditions from each rule if this reduces its estimated erroreduces s es a ed e o

Can produce duplicate rulesCheck for this at the end

Thenlook at each class in turnlook at each class in turnconsider the rules for that classfind a “good” subset (guided by MDL)find a good subset (guided by MDL)

Then rank the subsets to avoid conflicts

Finally, remove rules (greedily) if this decreases error on the training data

Slide 22Artificial Intelligence Machine Learning

Page 23: Lecture6 - C4.5

Next Class

Instance-based Classifiers

Slide 23Artificial Intelligence Machine Learning

Page 24: Lecture6 - C4.5

Introduction to MachineIntroduction to Machine LearningLearning

Lecture 6

Albert Orriols i Puigi l @ ll l [email protected]

Artificial Intelligence – Machine LearningEnginyeria i Arquitectura La Salleg y q

Universitat Ramon Llull