Post on 21-May-2020
UVACS6316/4501–Fall2016
MachineLearning
Lecture16:DecisionTree/RandomForest/Ensemble
Dr.YanjunQi
UniversityofVirginia
DepartmentofComputerScience
11/9/16
Dr.YanjunQi/UVACS6316/f16
1
Wherearewe?èFivemajorsecLonsofthiscourse
q Regression(supervised)q ClassificaLon(supervised)q Unsupervisedmodelsq Learningtheoryq Graphicalmodels
11/9/16 2
Dr.YanjunQi/UVACS6316/f16
11/9/16 3
hSp://scikit-learn.org/stable/tutorial/machine_learning_map/Dr.YanjunQi/UVACS6316/f16
ChoosingtherightesLmator
11/9/16 4
Scikit-learn:Regression
LinearmodelfiSedbyminimizingaregularizedempiricallosswithSGD
Dr.YanjunQi/UVACS6316/f16
Scikit-learn:ClassificaLon
11/9/16 5
Linearclassifiers(SVM,logisLcregression…)withSGDtraining.
approximatetheexplicitfeaturemappingsthatcorrespondtocertainkernelsTocombinethe
predicLonsofseveralbaseesLmatorsbuiltwithagivenlearningalgorithminordertoimprovegeneralizability/robustnessoverasingleesLmator.(1)averaging/bagging(2)boosLng
Dr.YanjunQi/UVACS6316/f16
11/9/16 6
BasicPCA
Bayes-NetHMM
Kmeans+GMM
nextaeerclassificaLon?Dr.YanjunQi/UVACS6316/f16
Today
Ø DecisionTree(DT):Ø TreerepresentaLon
Ø BriefinformaLontheoryØ LearningdecisiontreesØ BaggingØ Randomforests:EnsembleofDTØ Moreaboutensemble11/9/16 7
Dr.YanjunQi/UVACS6316/f16
A study comparing Classifiers
11/9/16 8
Dr.YanjunQi/UVACS6316/f16
Proceedingsofthe23rdInternaLonalConferenceonMachineLearning(ICML`06).
A study comparing Classifiers è 11binaryclassificaLonproblems/8metrics
11/9/16 9
Dr.YanjunQi/UVACS6316/f16
Top8Models
Wherearewe?èThreemajorsecLonsforclassificaLon
• We can divide the large variety of classification approaches into roughly three major types
1. Discriminative - directly estimate a decision rule/boundary - e.g., logistic regression, support vector machine, decisionTree 2. Generative: - build a generative statistical model - e.g., naïve bayes classifier, Bayesian networks 3. Instance based classifiers - Use observation directly (no models) - e.g. K nearest neighbors
11/9/16 10
Dr.YanjunQi/UVACS6316/f16
ADatasetforclassificaLon
• Data/points/instances/examples/samples/records:[rows]• Features/a0ributes/dimensions/independentvariables/covariates/predictors/regressors:[columns,exceptthelast]• Target/outcome/response/label/dependentvariable:specialcolumntobepredicted[lastcolumn]
11/9/16 11
Output as Discrete Class Label
C1, C2, …, CL
C
CDr.YanjunQi/UVACS6316/f16
Example
• Example: Play Tennis
C
11/9/16 12
Dr.YanjunQi/UVACS6316/f16
Anatomyofadecisiontree
overcast
high normal falsetrue
sunny rain
No NoYes Yes
Yes
Outlook
HumidityWindy
Eachnodeisatestononefeature/aJribute
PossibleaSributevaluesofthenode
Leavesarethedecisions
11/9/16 13
Dr.YanjunQi/UVACS6316/f16
Anatomyofadecisiontree
overcast
high normal falsetrue
sunny rain
No NoYes Yes
Yes
Outlook
HumidityWindy
EachnodeisatestononeaJribute
PossibleaSributevaluesofthenode
Leavesarethedecisions
Samplesize
Yourdatagetssmaller
11/9/16 14
Dr.YanjunQi/UVACS6316/f16
ApplyModeltoTestData:To‘playtennis’ornot.
overcast
high normal false true
sunny rain
No No Yes Yes
Yes
Outlook
Humidity Windy
Anewtestexample:(Outlook==rain)and(Windy==false)Passitonthetree->Decisionisyes.
11/9/16 15
Dr.YanjunQi/UVACS6316/f16
ApplyModeltoTestData:To‘playtennis’ornot.
overcast
high normal false true
sunny rain
No No Yes Yes
Yes
Outlook
Humidity Windy
(Outlook==overcast)->yes(Outlook==rain)and(Windy==false)->yes(Outlook==sunny)and(Humidity=normal)->yes
11/9/16 16
Dr.YanjunQi/UVACS6316/f16
Threecasesof“YES”
Decisiontrees• DecisiontreesrepresentadisjuncLonofconjuncLonsofconstraintsontheaSributevaluesofinstances.
• (Outlook ==overcast) • OR • ((Outlook==rain) and (Windy==false)) • OR • ((Outlook==sunny) and (Humidity=normal)) • => yes play tennis
11/9/16 17
Dr.YanjunQi/UVACS6316/f16
RepresentaLon
0 A 1
C B 0 1 1 0
false true false
Y=((AandB)or((notA)andC))
true
11/9/16 18
Dr.YanjunQi/UVACS6316/f16
Sameconcept/differentrepresentaLon
0 A 1
C B 0 1 1 0
false true false
Y=((AandB)or((notA)andC))
true 0 C 1
B A 0 1 0 1
false true false A 0 1
true false 11/9/16 19
Dr.YanjunQi/UVACS6316/f16
WhichaJributetoselectforspli\ng?
16+16-
8+8-
8+8-
4+4-
4+4-
4+4-
4+4-
2+2-
2+2-
Thisisbadspliqng…
thedistribuLonofeachclass(notaSribute)
11/9/16 20
Dr.YanjunQi/UVACS6316/f16
HowdowechoosewhichaJributetosplit?
WhichaSributeshouldbeusedfirsttotest?IntuiLvely,youwouldprefertheonethatseparatesthetrainingexamplesasmuchaspossible.
11/9/16 21
Dr.YanjunQi/UVACS6316/f16
Today
Ø DecisionTree(DT):Ø TreerepresentaLon
Ø BriefinformaLontheoryØ LearningdecisiontreesØ BaggingØ Randomforests:EnsembleofDTØ Moreaboutensemble11/9/16 22
Dr.YanjunQi/UVACS6316/f16
InformaLongainisonecriteriatodecideonwhichaSributeforspliqng
• Imagine:– 1.Someoneisabouttotellyouyourownname– 2.Youareabouttoobservetheoutcomeofadiceroll– 2.Youareabouttoobservetheoutcomeofacoinflip– 3.Youareabouttoobservetheoutcomeofabiasedcoinflip
• EachsituaLonhaveadifferentamountofuncertaintyastowhatoutcomeyouwillobserve.
11/9/16 23
Dr.YanjunQi/UVACS6316/f16
InformaLon• InformaLon:èReducLoninuncertainty(amountofsurpriseintheoutcome)
2 21( ) log log ( )( )
I E p xp x
= = −
Ø Observingtheoutcomeofacoinflipishead
Ø Observetheoutcomeofadiceis6
2log 1/ 2 1I = − =
2log 1/ 6 2.58I = − =
Iftheprobabilityofthiseventhappeningissmallandithappens,theinformaLonislarge.
11/9/16 24
Dr.YanjunQi/UVACS6316/f16
Entropy• Theexpectedamountofinforma9onwhenobservingthe
outputofarandomvariableX
2( ) ( ( )) ( ) ( ) ( ) log ( )i i i ii i
H X E I X p x I x p x p x= = = −∑ ∑
IftheXcanhave8outcomesandallareequallylikely
2( ) 1/8log 1/8 3i
H X == − =∑
11/9/16 25
Dr.YanjunQi/UVACS6316/f16
Entropy• Iftherearekpossible
outcomes
• Equalityholdswhenalloutcomesareequallylikely
• ThemoretheprobabilitydistribuLonthatdeviatesfromuniformity,thelowertheentropy
2( ) logH X k≤
e.g.forarandombinaryvariable11/9/16 26
Dr.YanjunQi/UVACS6316/f16
EntropyLowerèbeJerpurity
• Entropymeasuresthepurity
4+4-
8+0-
ThedistribuLonislessuniformEntropyislowerThenodeispurer
11/9/16 27
Dr.YanjunQi/UVACS6316/f16
Informa_ongain
• IG(X,Y)=H(Y)-H(Y|X)ReducLoninuncertaintyofYbyknowingafeaturevariableX
InformaLongain:=(informaLonbeforesplit)–(informaLonaeersplit)=entropy(parent)–[averageentropy(children)]
Fixed thelower,thebeSer(childrennodesarepurer)
– ForIG,thehigher,thebeSer=
11/9/16 28
Dr.YanjunQi/UVACS6316/f16
Condi_onalentropy
H (Y ) = − p(yi )log2 p(yi )i∑
H (Y | X ) = p(x j )
j∑ H (Y | X = x j )
= − p(x j )
j∑ p( yi | x j ) log2 p( yi | x j )
i∑
11/9/16 29
Dr.YanjunQi/UVACS6316/f16
H (Y | X = x j ) = − p( yi | x j ) log2 p( yi | x j )
i∑
ExampleX1 X2 Y Count T T + 2 T F + 2 F T - 5 F F + 1
ASributes Labels
IG(X1,Y) = H(Y) – H(Y|X1) H(Y) = - (5/10) log(5/10) -5/10log(5/10) = 1 H(Y|X1) = P(X1=T)H(Y|X1=T) + P(X1=F) H(Y|X1=F) = 4/10 (1log 1 + 0 log 0) +6/10 (5/6log 5/6 +1/6log1/6)
= 0.39
InformaLongain(X1,Y)=1-0.39=0.61
WhichonedowechooseX1orX2?
11/9/16 30
Dr.YanjunQi/UVACS6316/f16
ExampleX1 X2 Y Count T T + 2 T F + 2 F T - 5 F F + 1
ASributes Labels
IG(X1,Y) = H(Y) – H(Y|X1) H(Y) = - (5/10) log(5/10) -5/10log(5/10) = 1 H(Y|X1) = P(X1=T)H(Y|X1=T) + P(X1=F) H(Y|X1=F) = 4/10 (1log 1 + 0 log 0) +6/10 (5/6log 5/6 +1/6log1/6)
= 0.39
InformaLongain(X1,Y)=1-0.39=0.61
WhichonedowechooseX1orX2?
11/9/16 31
Dr.YanjunQi/UVACS6316/f16
Whichonedowechoose?
X1 X2 Y Count T T + 2 T F + 2 F T - 5 F F + 1
Information gain (X1,Y)= 0.61 Information gain (X2,Y)= 0.12
PickX1Pick the variable which provides the most information gain about Y
èThenrecursivelychoosenextXionbranches
X1 X2 Y Count T T + 2 T F + 2 F T - 5 F F + 1
Onebranch
Theotherbranch
11/9/16 32
Dr.YanjunQi/UVACS6316/f16
11/9/16
Dr.YanjunQi/UVACS6316/f16
33
DecisionTrees• Caveats:Thenumberofpossiblevaluesinfluencesthe
informaLongain.• Themorepossiblevalues,thehigherthegain(themorelikelyitisto
formsmall,butpureparLLons)
• OtherPurity(diversity)measures– InformaLonGain– Gini(populaLondiversity)
• wherepmkisproporLonofclasskatnodem
– Chi-squareTest
11/9/16 34
Dr.YanjunQi/UVACS6316/f16
Overfi\ng
• YoucanperfectlyfitDTtoanytrainingdata
• InstabilityofTrees○ Highvariance(smallchangesintrainingsetwill
resultinchangesoftreemodel)○ HierarchicalstructureèErrorintopsplit
propagatesdown
• Twoapproaches:– 1.Stopgrowingthetreewhenfurtherspliqngthedatadoesnot
yieldanimprovement– 2.Growafulltree,thenprunethetree,byeliminaLngnodes.
11/9/16 35
Dr.YanjunQi/UVACS6316/f16
FromESLbookCh9:ClassificaLonandRegressionTrees(CART)● Par__onfeature
spaceintosetofrectangles
● Fitsimplemodelin
eachparLLon
Summary:Decisiontrees
• Non-linearclassifier• Easytouse• Easytointerpret• SuscepLbletooverfiqngbutcanbeavoided.
11/9/16 37
Dr.YanjunQi/UVACS6316/f16
Decision Tree / Random Forest
Greedy to find partitions
Split Purity measure / e.g. IG / cross-entropy / Gini /
Tree Model (s), i.e. space partition
Task
Representation
Score Function
Search/Optimization
Models, Parameters
11/9/16 38
Classification
Partition feature space into set of rectangles, local smoothness
Dr.YanjunQi/UVACS6316/f16
Today
Ø DecisionTree(DT):Ø TreerepresentaLon
Ø BriefinformaLontheoryØ LearningdecisiontreesØ BaggingØ Randomforests:EnsembleofDTØ Moreaboutensemble11/9/16 39
Dr.YanjunQi/UVACS6316/f16
Bagging
• Baggingorbootstrapaggrega9on• atechniqueforreducingthevarianceofanesLmatedpredicLonfuncLon.
• Forinstance,forclassificaLon,acommi0eeoftrees• Eachtreecastsavoteforthepredictedclass.
11/9/16 40
Dr.YanjunQi/UVACS6316/f16
BootstrapThebasicidea:randomlydrawdatasetswithreplacement(i.e.allowsduplicates)fromthetrainingdata,eachsamplethesamesizeastheoriginaltrainingset
11/9/16 41
Dr.YanjunQi/UVACS6316/f16
WithvsWithoutReplacement
• Bootstrapwithreplacementcankeepthesamplingsizethesameastheoriginalsizeforeveryrepeatedsampling.Thesampleddatagroupsareindependentoneachother.
• Bootstrapwithoutreplacementcannotkeepthesamplingsizethesameastheoriginalsizeforeveryrepeatedsampling.Thesampleddatagroupsaredependentoneachother.
11/9/16 42
Dr.YanjunQi/UVACS6316/f16
BaggingNexamples
Createbootstrapsamplesfromthetrainingdata
....…
Mfeatures
11/9/16 43
Dr.YanjunQi/UVACS6316/f16
BaggingofDTClassifiersNexamples
....…
....…
Takethemajorityvote
Mfeatures
e.g.
i.e.Refitthemodeltoeachbootstrapdataset,andthenexaminethebehaviorovertheBreplicaLons.
11/9/16 44
Dr.YanjunQi/UVACS6316/f16
BaggingforClassifica_onwith0,1Loss
• ClassificaLonwith0,1loss– BaggingagoodclassifiercanmakeitbeJer.
– Baggingabadclassifiercanmakeitworse.
– Canunderstandthebaggingeffectintermsofaconsensusofindependentweakleanersandwisdomofcrowds
11/9/16 45
Dr.YanjunQi/UVACS6316/f16
Peculiari_es
• ModelInstabilityisgoodwhenbagging
– Themorevariable(unstable)thebasicmodelis,themoreimprovementcanpotenLallybeobtained
– Low-Variabilitymethods(e.g.LDA)improvelessthanHigh-Variabilitymethods(e.g.decisiontrees)
• LoadofRedundancy– Mostpredictorsdoroughly“thesamething”
11/9/16 46
Dr.YanjunQi/UVACS6316/f16
Bagging:ansimulatedexample • N=30trainingsamples,• twoclassesandp=5features,• EachfeatureN(0,1)distribuLonandpairwisecorrelaLon.95• ResponseYgeneratedaccordingto:• Testsamplesizeof2000• FitclassificaLontreestotrainingsetandbootstrapsamples• B=200
ESLbook/Example8.7.1
NoLcethebootstraptreesaredifferentthantheoriginaltree
FivefeatureshighlycorrelatedwitheachotherèNocleardifferencewithpickingupwhichfeaturetosplitè Small
changesinthetrainingsetwillresultindifferenttree
è ButthesetreesareactuallyquitesimilarforclassificaLon
ESLbook/Example8.7.1
• Consensus:Majorityvote• Probability:AveragedistribuLonatterminalnodes
ESLbook/Example8.7.1
B
è ForB>30,moretreesdonotimprovethebaggingresults
è Sincethetrees
correlatehighlytoeachotherandgivesimilarclassificaLons
Bagging
• Slightlyincreasesmodelspace– Cannothelpwheregreaterenlargementofspaceisneeded
• Baggedtreesarecorrelated– UserandomforesttoreducecorrelaLonbetweentrees
Today
Ø DecisionTree(DT):Ø TreerepresentaLon
Ø BriefinformaLontheoryØ LearningdecisiontreesØ BaggingØ Randomforests:specialensembleofDTØ Moreaboutensemble11/9/16 51
Dr.YanjunQi/UVACS6316/f16
Randomforestclassifier
• Randomforestclassifier,– anextensiontobagging– whichusesde-correlatedtrees.
11/9/16 52
Dr.YanjunQi/UVACS6316/f16
RandomForestClassifierNexamples
Createbootstrapsamplesfromthetrainingdata
....…
Mfeatures
11/9/16 53
Dr.YanjunQi/UVACS6316/f16
RandomForestClassifierNexamples
....…
Mfeatures
Ateachnodewhenchoosingthesplitfeaturechooseonlyamongm<Mfeatures
11/9/16 54
Dr.YanjunQi/UVACS6316/f16
RandomForestClassifierCreatedecisiontree
fromeachbootstrapsample
Nexamples
....…
....…
Mfeatures
11/9/16 55
Dr.YanjunQi/UVACS6316/f16
RandomForestClassifierNexamples
....…
....…
Takehemajorityvote
Mfeatures
11/9/16 56
Dr.YanjunQi/UVACS6316/f16
Random Forests
1. ForeachofourBbootstrapsamplesa. Formatreeinthefollowingmanner
i. Givenpdimensions,pickmofthemii. Splitonlyaccordingtothesemdimensions
1. (wewillNOTconsidertheotherp-mdimensions)
iii. Repeattheabovestepsi&iiforeachsplit1. Note:wepickadifferentsetofmdimensionsforeachsplit
onasingletree
11/9/16 58
Dr.YanjunQi/UVACS6316/f16
Page598-599InESLbook
Random Forests • Randomforestcanbeviewedasarefinementofbaggingwitha
tweakofdecorrela_ngthetrees:
o Ateachtreesplit,arandomsubsetofmfeaturesoutofallpfeaturesisdrawntobeconsideredforspliqng
• SomeguidelinesprovidedbyBreiman,butbecarefultochoosembasedonspecificproblem:
o m=pamountstobaggingo m=p/3orlog2(p)forregressiono m=sqrt(p)forclassificaLon
Why correlated trees are not ideal ?
• RandomForeststrytoreducecorrelaLonbetweenthetrees.
• Why?
Why correlated trees are not ideal ?
• Assumingeachtreehasvarianceσ2
• IftreesareindependentlyidenLcallydistributed,thenaveragevarianceisσ2/B
Why correlated trees are not ideal ?
• Assumingeachtreehasvarianceσ2
• IfsimplyidenLcallydistributed,thenaveragevarianceis
ρreferstopairwisecorrelaLon,aposiLvevalue
• AsB→∞,secondterm→0• Thus,thepairwisecorrelaLonalwaysaffectsthevariance
Why correlated trees are not ideal ?
• Howtodeal?o Ifwereducem(thenumberofdimensionswe
actuallyconsider),o thenwereducethepairwisetreecorrelaLon
o Thus,variancewillbereduced.
Today
Ø DecisionTree(DT):Ø TreerepresentaLon
Ø BriefinformaLontheoryØ LearningdecisiontreesØ BaggingØ Randomforests:EnsembleofDTØ Moreensemble
11/9/16 64
Dr.YanjunQi/UVACS6316/f16
e.g. Ensembles in practice
Oct2006-2009
EachraLng/sample:+<user,movie,dateofgrade,grade>Trainingset(100,480,507raLngs)Qualifyingset(2,817,131raLngs)èwinner
Ensemble in practice Team“Bellkor'sPragmaLcChaos”defeatedtheteam“ensemble”bysubmiqngjust20minutesearlier!è1milliondollar!
TheensembleteamèblendersofmulLpledifferentmethods
References
q Prof.Tan,Steinbach,Kumar’s“IntroducLontoDataMining”slide
q HasLe,Trevor,etal.Theelementsofsta9s9callearning.Vol.2.No.1.NewYork:Springer,2009.
q Dr.OznurTastan’sslidesaboutRFandDT
11/9/16 67
Dr.YanjunQi/UVACS6316/f16