Learning Decision Trees - svivek · Basic Decision Tree Algorithm: ID3 1.If all examples are have...

MachineLearning

LearningDecisionTrees

1SomeslidesfromTomMitchell,DanRothandothers

Thislecture:LearningDecisionTrees

1. Representation:Whataredecisiontrees?

2. Algorithm:Learningdecisiontrees– TheID3algorithm:Agreedyheuristic

3. Someextensions

2

Thislecture:LearningDecisionTrees

1. Representation:Whataredecisiontrees?

2. Algorithm:Learningdecisiontrees– TheID3algorithm:Agreedyheuristic

3. Someextensions

3

HistoryofDecisionTreeResearch

• Fullsearchdecisiontreemethodstomodelhumanconceptlearning:Huntetal60s,psychology

• QuinlandevelopedtheID3(IterativeDichotomiser 3)algorithm,withtheinformationgainheuristictolearnexpertsystemsfromexamples(late70s)

• Breiman,FreidmanandcolleaguesinstatisticsdevelopedCART (ClassificationAndRegressionTrees)

• Avarietyofimprovementsinthe80s:copingwithnoise,continuousattributes,missingdata,non-axisparallel,etc.

• Quinlan’supdatedalgorithms,C4.5(1993)andC5aremorecommonlyused

• Boosting(orBagging)overdecisiontreesisaverygoodgeneralpurposealgorithm

4

WillIplaytennistoday?

• Features– Outlook: {Sun,Overcast,Rain}– Temperature: {Hot,Mild,Cool}– Humidity: {High,Normal,Low}– Wind: {Strong,Weak}

• Labels– Binaryclassificationtask:Y={+,-}

5

WillIplaytennistoday?Outlook: Sunny,

Overcast,Rainy

Temperature: Hot,Medium,Cool

Humidity: High,Normal,Low

Wind: Strong,Weak

6

O T H W Play?1 S H H W -2 S H H S -3 O H H W +4 R M H W +5 R C N W +6 R C N S -7 O C N S +8 S M H W -9 S C N W +10 R M N W +11 S M N S +12 O M H S +13 O H N W +14 R M H S -

BasicDecisionTreeLearningAlgorithm

• DataisprocessedinBatch(i.e.allthedataavailable)

7



• DataisprocessedinBatch(i.e.allthedataavailable)• Recursivelybuildadecisiontreetopdown.

8


Outlook?

Sunny Overcast Rain

Humidity?

High Normal

Wind?

Strong Weak

No Yes

Yes

YesNo



9


Outlook?

Sunny Overcast Rain

Humidity?

High Normal

Wind?

Strong Weak

No Yes

Yes

YesNo

1. Decidewhatattributegoesatthetop



10


Outlook?

Sunny Overcast Rain

Humidity?

High Normal

Wind?

Strong Weak

No Yes

Yes

YesNo

1. Decidewhatattributegoesatthetop

2. Decidewhattodoforeachvaluetherootattributetakes

BasicDecisionTreeAlgorithm:ID3

1. Ifallexamplesarehavesamelabel:Returnasinglenodetreewiththelabel

2. Otherwise1. CreateaRoot node fortree2. A =attributeinAttributesthatbest classifiesS3. foreachpossiblevalue v ofthatAcantake:

1. AddanewtreebranchcorrespondingtoA=v2. LetSv bethesubsetofexamplesinS withA=v3. ifSv isempty:

addleafnodewiththecommonvalueofLabelinSElse:belowthisbranchaddthesubtree ID3(Sv,Attributes- {A},Label)

4. ReturnRoot node

why?

Forgeneralizationattesttime

Input:S the setofExamplesAttributes isthesetofmeasuredattributes

11

ID3(S,Attributes):






4. ReturnRoot node

why?



12

ID3(S,Attributes):






4. ReturnRoot node

why?



13

ID3(S,Attributes):

Decidewhatattributegoesatthetop






4. ReturnRoot node

why?



14

ID3(S,Attributes):

Decidewhatattributegoesatthetop






4. ReturnRoot node

why?



15

ID3(S,Attributes):

Decidewhattodoforeachvaluetherootattributetakes




1. AddanewtreebranchforattributeAtakingvaluev2. LetSv bethesubsetofexamplesinS withA=v3. ifSv isempty:


4. ReturnRoot node

why?



16

ID3(S,Attributes):







4. ReturnRoot node

why?



17

ID3(S,Attributes):







4. ReturnRoot node



18

ID3(S,Attributes):







4. ReturnRoot node

why?


19

ID3(S,Attributes):







4. ReturnRoot node

why?


20

ID3(S,Attributes):







addleafnodewiththecommonvalueofLabelinSElse:belowthisbranchaddthesubtree ID3(Sv,Attributes- {A})

4. ReturnRoot node

why?


21

ID3(S,Attributes):


RecursivecalltotheID3algorithmwithalltheremainingattributes


PickingtheRootAttribute

• Goal:Havetheresultingdecisiontreeassmallaspossible(Occam’sRazor)– But,findingtheminimaldecisiontreeconsistentwithdataisNP-hard

• Therecursivealgorithmisagreedyheuristicsearchforasimpletree,butcannotguaranteeoptimality

• Themaindecisioninthealgorithmistheselectionofthenextattributetospliton

22

PickingtheRootAttributeConsiderdatawithtwoBooleanattributes(A,B).

<(A=0,B=0),- >:50examples<(A=0,B=1),- >:50examples<(A=1,B=0),- >:0examples<(A=1,B=1),+ >:100examples

23



Whatshouldbethefirstattributeweselect?

24



A

+ -01


SplittingonA: wegetpurelylabelednodes.

25



A

+ -01

B

-01

A

+ -01

SplittingonB:wedon’tgetpurelylabelednodes.



26



A

+ -01

B

-01

A

+ -01

SplittingonB:wedon’tgetpurelylabelednodes.

Whatifwehave:<(A=1,B=0),- >:3examples



27


<(A=0,B=0),- >:50examples<(A=0,B=1),- >:50examples<(A=1,B=0),- >:0examples3examples<(A=1,B=1),+ >:100examples

Whichattributeshouldwechoose?

28



B

-

01

A

+ -01


29



B

-

01

A

+ -01

A

-

01

B

+ -

01


30



B

-

01

A

+ -01

A

-

01

B

+ -

01


31

Treeslooksstructurallysimilar!



B

-

01

A

+ -01

A

-

01

B

+ -

0153

50 3

100

100100


32




B

-

01

A

+ -01

A

-

01

B

+ -

0153

50 3

100

100100


AdvantageA.But…Needawaytoquantifythings

33



Goal:Havetheresultingdecisiontreeassmallaspossible(Occam’sRazor)

• Themaindecisioninthealgorithmistheselectionofthenextattributeforsplittingthedata

• Wewantattributesthatsplittheexamplestosetsthatarerelativelypureinonelabel– Thiswayweareclosertoaleafnode.

• Themostpopularheuristicisinformationgain,originatedwiththeID3systemofQuinlan

34

Reminder:Entropy

Entropy(impurity,disorder)ofasetofexamplesSwithrespecttobinaryclassificationis

𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 = 𝐻 𝑆 = −𝑝, log0 𝑝, − 𝑝1 log0 𝑝1

• Theproportionofpositiveexamplesis𝑝,• Theproportionofnegativeexamplesis𝑝1

Ingeneral,foradiscreteprobabilitydistributionwithKpossiblevalues,withprobabilities{𝑝4, 𝑝0,⋯ , 𝑝7} theentropyisgivenby

𝐻 𝑝4, 𝑝0,⋯ , 𝑝7 = −9𝑝: log0 𝑝:

;

:

35

Reminder:Entropy




Ingeneral,foradiscreteprobabilitydistributionwithKpossiblevalues,withprobabilities{𝑝4, 𝑝0,⋯ , 𝑝7} theentropyisgivenby

𝐻 𝑝4, 𝑝0,⋯ , 𝑝7 = −9𝑝: log0 𝑝:

;

:

36

Reminder:Entropy




• Ifallexamplesbelongtothesamecategory,thenentropy=0

• If𝑝+= 𝑝

−= 4

0thenentropy=1

37

Reminder:Entropy





• If𝑝+= 𝑝

−= 4

0thenentropy=1

38

Entropycanbeviewedasthenumberofbitsrequired,onaverage,toencodeinformation.

Iftheprobabilityfor+is0.5,asinglebitisrequiredforeachexample;ifitis0.8,wecanuselessthen1bit.

Reminder:Entropy





• If𝑝+= 𝑝

−= 4

0thenentropy=1

39

11

- +

1

- + - +

Reminder:Entropy



Theuniformdistributionhasthehighestentropy

1 1 1

40

Reminder:Entropy




1 1 1

41

Reminder:Entropy




1 1 1

42

HighEntropy:HighlevelofUncertainty

LowEntropy:LowUncertainty


Goal:Havetheresultingdecisiontreeassmallaspossible(Occam’sRazor)

• Themaindecisioninthealgorithmistheselectionofthenextattributeforsplittingthedata

• Wewantattributesthatsplittheexamplestosetsthatarerelativelypureinonelabel– Thiswayweareclosertoaleafnode.

• Themostpopularheuristicisinformationgain,originatedwiththeID3systemofQuinlan

43

Intuition:Choosetheattributethatreducesthelabelentropythemost

InformationGain

TheinformationgainofanattributeAistheexpectedreductioninentropycausedbypartitioningonthisattribute

Sv:thesubsetofexampleswherethevalueofattributeAissettovaluev

Entropyofpartitioningthedataiscalculatedbyweighingtheentropyofeachpartitionbyitssizerelativetotheoriginalset

– Partitionsoflowentropy(imbalancedsplits)leadtohighgain

GobacktocheckwhichoftheA,Bsplitsisbetter

44

InformationGain






45

InformationGain






46

InformationGain






47

HighEntropy:HighlevelofUncertainty

LowEntropy:LowUncertainty

WillIplaytennistoday?Outlook: S(unny),

O(vercast),R(ainy)

Temperature: H(ot),M(edium),C(ool)

Humidity: H(igh),N(ormal),L(ow)

Wind: S(trong),W(eak)

48


WillIplaytennistoday?

Currententropy:p =9/14n=5/14

H(Play?) =−(9/14)log2(9/14)−(5/14)log2(5/14)» 0.94

49


InformationGain:Outlook

50


InformationGain:OutlookOutlook=sunny:5of14examples

p=2/5n=3/5 HS=0.971

51



p=2/5n=3/5 HS=0.971

Outlook=overcast: 4of14examplesp=4/4n=0 Ho=0

Outlook=rainy: 5of14examplesp=3/5n=2/5 HR=0.971

Expectedentropy:

Informationgain:0.940– 0.694=0.246

52

(5/14)×0.971+(4/14)×0 +(5/14)×0.971=0.694



p=2/5n=3/5 HS=0.971

Outlook=overcast: 4of14examplesp=4/4n=0 Ho=0

Outlook=rainy: 5of14examplesp=3/5n=2/5 HR=0.971

Expectedentropy:


53

(5/14)×0.971+(4/14)×0 +(5/14)×0.971=0.694


InformationGain:HumidityHumidity=high:

p=3/7n=4/7 Hh =0.985Humidity=Normal:

p=6/7n=1/7 Ho=0.592

Expectedentropy:(7/14)×0.985+(7/14)×0.592=0.7885


54


InformationGain:HumidityHumidity=High:


p=6/7n=1/7 Ho=0.592



55




p=6/7n=1/7 Ho=0.592



56




p=6/7n=1/7 Ho=0.592



57


Whichfeaturetospliton?Informationgain:

Outlook:0.246Humidity:0.151Wind:0.048Temperature:0.029

58


Whichfeaturetospliton?Informationgain:

Outlook:0.246Humidity:0.151Wind:0.048Temperature:0.029

→SplitonOutlook

59


AnIllustrativeExample

OutlookGain(S,Humidity)=0.151Gain(S,Wind)=0.048Gain(S,Temperature)=0.029Gain(S,Outlook)=0.246

60


Outlook

Overcast Rain3,7,12,13 4,5,6,10,14

3+,2-

Sunny1,2,8,9,11

4+,0-2+,3-Yes? ?

61



Outlook

Overcast Rain3,7,12,13 4,5,6,10,14

3+,2-

Sunny1,2,8,9,11

4+,0-2+,3-Yes? ?

Continueuntil:• Everyattributeisincludedinpath, or,• Allexamplesintheleafhavesamelabel

62


AnIllustrativeExampleGain(Ssunny,Humidity)=.97-(3/5)0-(2/5)0=.97

Gain(Ssunny,Temp)=.97- 0-(2/5)1=.57

Gain(Ssunny,wind)=.97-(2/5)1- (3/5).92=.02

Day Outlook TemperatureHumidityWindPlayTennis1Sunny Hot High WeakNo2Sunny Hot High StrongNo8Sunny Mild High WeakNo9Sunny Cool Normal Weak Yes11Sunny Mild NormalStrong Yes

Outlook

Overcast Rain3,7,12,13 4,5,6,10,14

3+,2-

Sunny1,2,8,9,11

4+,0-2+,3-Yes? ?

63

AnIllustrativeExampleOutlook

Overcast Rain

3,7,12,13 4,5,6,10,143+,2-

Sunny1,2,8,9,11

4+,0-2+,3-YesHumidity

NormalHighNo Yes

64

AnIllustrativeExampleOutlook

Overcast Rain

3,7,12,13 4,5,6,10,143+,2-

Sunny1,2,8,9,11

4+,0-2+,3-YesHumidity Wind

NormalHighNo Yes

WeakStrongNo Yes

65

HypothesisSpaceinDecisionTreeInduction

• Searchoverdecisiontrees,whichcanrepresentallpossiblediscretefunctions(hasprosandcons)

• Goal:tofindthebest decisiontree

• FindingaminimaldecisiontreeconsistentwithasetofdataisNP-hard.

• ID3performsagreedyheuristicsearch:hillclimbingwithoutbacktracking

• Makesstatisticallybaseddecisionsusingalldata

66

Summary:LearningDecisionTrees

1. Representation:Whataredecisiontrees?– Ahierarchicaldatastructurethatrepresentsdata

2. Algorithm:LearningdecisiontreesTheID3algorithm:Agreedyheuristic

• Ifalltheexampleshavethesamelabel,createaleafwiththatlabel

• Otherwise,findthe“mostinformative”attributeandsplitthedatafordifferentvaluesofthatattributes

• Recurse onthesplits

67

Learning Decision Trees - svivek · Basic Decision Tree Algorithm: ID3 1.If all examples are have...

Documents

Transcript of Learning Decision Trees - svivek · Basic Decision Tree Algorithm: ID3 1.If all examples are have...