CPSC 340: Machine Learning and Data Mining
Transcript of CPSC 340: Machine Learning and Data Mining
CPSC340:MachineLearningandDataMining
DecisionTrees
OriginalversionoftheseslidesbyMarkSchmidt,withmodificationsbyMikeGelbart. 1
Admin• Assignment0 isdueWednesdayat9pm(in2days)• Assignment1shouldbereleasedWednesday,dueaweeklater– Ifyouwanttoworkwithapartner,youbothmustrequestitBEFOREa1release– InstructionsintheHomeworkSubmissionInstructionsdocument
• Importantwebpages:– https://www.cs.ubc.ca/getacct/– https://github.ugrad.cs.ubc.ca/CPSC340-2017W-T2/home– https://piazza.com/class/j9uk5ecmb7e4ks
• Tutorialsandofficehoursstartthisweek.– Seecoursehomepagefortutorialtopicsandofficehoursschedule.
• Auditing– Noroomforofficialauditors.– Unofficialauditors,pleasedonottakeseatsifothersarestanding.
2
LastTime:DataRepresentationandExploration• Wediscussedobject-featurerepresentation:– Examples:anothernamewe’lluseforobjects.
• Wediscussedsummarystatistics andvisualizingdata.
Age Job? City Rating Income
23 Yes Van A 22,000.0023 Yes Bur BBB 21,000.0022 No Van CC 0.0025 Yes Sur AAA 57,000.00
3
MotivatingExample:FoodAllergies• Youfrequentlystartgettinganupsetstomach• Yoususpectanadult-onsetfoodallergy.
4
MotivatingExample:FoodAllergies• Tosolvethemystery,youstartafoodjournal:
• Butit’shardtofindthepattern:– Youcan’tisolateandonlyeatonefoodatatime.– Youmaybeallergictomorethanonefood.– Thequantitymatters:asmallamountmaybeok.– Youmaybeallergictospecificinteractions.
Egg Milk Fish Wheat Shellfish Peanuts … Sick?
0 0.7 0 0.3 0 0 1
0.3 0.7 0 0.6 0 0.01 1
0 0 0 0.8 0 0 0
0.3 0.7 1.2 0 0.10 0.01 1
0.3 0 1.2 0.3 0.10 0.01 1
5
SupervisedLearning• Wecanformulatethisassupervisedlearning:
• Inputforanobject (dayoftheweek)isasetoffeatures (quantitiesoffood).• Outputisadesiredclasslabel(whetherornotwegotsick).• Goalofsupervisedlearning:
– Usedatatofindamodelthatoutputstherightlabelbasedonthefeatures.– Modelpredictswhetherfoodswillmakeyousick(evenwithnewcombinations).
Egg Milk Fish Wheat Shellfish Peanuts …
0 0.7 0 0.3 0 0
0.3 0.7 0 0.6 0 0.01
0 0 0 0.8 0 0
0.3 0.7 1.2 0 0.10 0.01
0.3 0 1.2 0.3 0.10 0.01
Sick?
1
1
0
1
1
6
SupervisedLearning• Generalsupervisedlearning problem:– Takefeaturesofobjectsandcorrespondinglabelsasinputs.– Findamodelthatcanaccuratelypredictthelabelsofnewobjects.
• Thisisthemostsuccessfulmachinelearningtechnique:– Spamfiltering,opticalcharacterrecognition,MicrosoftKinect,speechrecognition,classifyingtumours,etc.
• We’llfirstfocusoncategoricallabels,whichiscalled“classification”.– Themodelisacalleda“classifier”.
7
NaïveSupervisedLearning:“PredictMode”
• Averynaïvesupervisedlearningmethod:– Counthowmanytimeseachlabeloccurredinthedata(4vs.1above).– Alwayspredictthemostcommonlabel,the“mode”(“sick”above).
• Thisignoresthefeatures,soisonlyaccurateifweonlyhave1label.• Thereisnounique“right”waytousethefeatures.– Todaywe’llconsideraclassicwayknownasdecisiontreelearning.
Egg Milk Fish Wheat Shellfish Peanuts …
0 0.7 0 0.3 0 0
0.3 0.7 0 0.6 0 0.01
0 0 0 0.8 0 0
0.3 0.7 1.2 0 0.10 0.01
0.3 0 1.2 0.3 0.10 0.01
Sick?
1
1
0
1
1
8
DecisionTrees• Decisiontreesaresimpleprogramsconsistingof:
– Anestedsequenceof“if-else”decisionsbasedonthefeatures (splittingrules).– Aclasslabelasareturnvalueattheendofeachsequence.
• Exampledecisiontree:
if(milk>0.5){
return‘sick’}else{
if(egg>1)return‘sick’
elsereturn‘notsick’
}
Candrawsequencesofdecisionsasatree:
9
SupervisedLearningasWritingAProgram• Therearemanypossibledecisiontrees.– We’regoingtosearchforonethatisgoodatoursupervisedlearningproblem.
• Soourinputisdataandtheoutputwillbeaprogram.– Thisiscalled“training”thesupervisedlearningmodel.– Differentthanusualinput/outputspecificationforwritingaprogram.
• SupervisedlearningisusefulwhenyouhavelotsoflabeleddataBUT:1. problemistoocomplicatedtowriteaprogramourselves,or2. humanexpertcan’texplainwhyyouassigncertainlabels,or3. wedon’thaveahumanexpertfortheproblem.
10
LearningADecisionStump• We’llstartwith"decisionstumps”:– Simpledecisiontreewith1splittingrulebasedonthresholding1feature.
• Howdowefindthebest“rule”(feature,threshold,andleaflabels)?1. Definea‘score’fortherule.2. Searchfortherulewiththebestscore.
11
DecisionStump:AccuracyScore• Mostintuitivescore:classificationaccuracy.– “Ifweusethisrule,howmanyobjectsdowelabelcorrectly?”
• Computingclassificationaccuracyfor(egg>1):– Findmostcommonlabelsifweusethisrule:
• When(egg>1),wewere“sick”bothtimes.• When(egg<=1),wewere“notsick”threeoutoffourtimes.
– Computeaccuracy:• Rule(egg>1)iscorrecton5/6objects.
• Scoresofotherrules:– (milk>0.5)obtainsloweraccuracyof4/6.– (egg>0)obtainsoptimalaccuracyof6/6.– ()obtains“baseline”accuracyof3/6,asdoes(egg>2).
Egg Milk Fish …
1 0.7 0
2 0.7 0
0 0 0
0 0.7 1.2
2 0 1.2
0 0 0
Sick?
1
1
0
0
1
0
12
DecisionStump:RuleSearch(Attempt1)• Accuracy“score”evaluatesqualityofarule.– Findthebestrulebymaximizingscore.
• Attempt1(exhaustivesearch):
• Asyougo,keeptrackofthehighestscore.• Returnhighest-scoringrule (variable,threshold,andleafvalues).
Computescoreof(egg>0) Computescoreof(milk>0) …Computescoreof(egg>0.01) Computescoreof(milk>0.01) …Computescoreof(egg>0.02) Computescoreof(milk>0.02) …Computescoreof(egg>0.03) Computescoreof(milk>0.03) …… … …Computescoreof(egg>99.99) Computescoreof(milk>0.99) …
13
SupervisedLearningNotation(MEMORIZETHIS)
• Featurematrix‘X’ hasrowsasobjects,columnsasfeatures.– xij isfeature‘j’forobject‘i’(quantityoffood‘j’onday‘i’).– xi isthelistofallfeaturesforobject‘i’(allthequantitiesonday‘i’).– xj iscolumn‘j’ofthematrix (thevalueoffeature‘j’acrossallobjects).
• Labelvector‘y’ containsthelabelsoftheobjects.– yi isthelabelofobject ‘i’ (1for“sick”,0for“notsick”).
Egg Milk Fish Wheat Shellfish Peanuts
0 0.7 0 0.3 0 0
0.3 0.7 0 0.6 0 0.01
0 0 0 0.8 0 0
0.3 0.7 1.2 0 0.10 0.01
0.3 0 1.2 0.3 0.10 0.01
Sick?
1
1
0
1
1
14
SupervisedLearningNotation(MEMORIZETHIS)
• Trainingphase:– Use‘X’and‘y’tofinda‘model’(likeadecisionstump).
• Predictionphase:– Givenanobjectxi,usethe‘model’topredictalabel‘yhati’ (“sick”or“notsick”).
• Trainingerror:– Fractionoftimesourprediction‘yhati’doesnotequalthetrueyi label.
Egg Milk Fish Wheat Shellfish Peanuts
0 0.7 0 0.3 0 0
0.3 0.7 0 0.6 0 0.01
0 0 0 0.8 0 0
0.3 0.7 1.2 0 0.10 0.01
0.3 0 1.2 0.3 0.10 0.01
Sick?
1
1
0
1
1
15
DecisionStumpLearningPseudo-Code
16
CostofDecisionStumps(Attempt1)• Howmuchdoesthiscost?• Assumewehave:– ‘n’objects(daysthatwemeasured).– ‘d’features(foodsthatwemeasured).– ‘k’thresholds(>0,>0.01,>0.02,…)
• ComputingthescoreofonerulecostsO(n):– Weneedtogothroughall‘n’examples.– Seenotesonwebpageforreviewof“O(n)”notation.
• Tocomputescoresford*krules,totalcostisO(ndk).– But‘k’mightbehuge
• Canwedobetter?17
SpeedingupRuleSearch• Wecanignorerulesoutsidefeatureranges:– E.g., weneverhave(egg>50)inourdata.– Theserulescanneverimproveaccuracy.– Restrictthresholdstorangeoffeatures.
• Mostofthethresholdsgivethesamescore.– Ifweneverhave(0.5<egg<1)inthedata,
• then(egg<0.6)and(egg<0.9)havethesamescore.
– Restrictthresholdstovaluesindata.
18
DecisionStump:RuleSearch(Attempt2)• Attempt2(searchonlyoverfeaturesindata):
• Nowatmost‘n’thresholdsforeachfeature.• WeonlyconsiderO(nd)rulesinsteadofO(dk)rules:– TotalcostchangesfromO(ndk)toO(n2d).
Computescoreof(eggs>0) Computescoreof(milk>0.5) …Computescoreof(eggs>1) Computescoreof(milk>0.7) …Computescoreof(eggs>2) Computescoreof(milk>1) …Computescoreof(eggs>3) Computescoreof(milk>1.25) …Computescoreof(eggs>4) …
19
DecisionStump:RuleSearch(Attempt3)• Dowehavetocomputethescorefromscratch?
– Rule(egg>1)and(egg>2)havesamedecisions,exceptwhen(egg==2).• Wecanactuallycomputethebestruleinvolving‘egg’inO(nlogn):
– Sorttheexamplesbasedon‘egg’,andusethesepositionstore-arrange‘y’.– Gothroughthesortedvaluesinorder,updatingthecountsof#sickand#not-sickthat
bothsatisfyanddon’tsatisfytherules.– Withthesecounts,it’seasytocomputetheclassificationaccuracy(seebonusslide).
• SortingcostsO(nlogn)perfeature.• TotalcostofupdatingcountsisO(n)perfeature.• TotalcostisreducedfromO(n2d)toO(nd logn).• Thisisagoodruntime:
– O(nd)isthesizeofdata,soO(nd logn)is sameaslookingatdata,uptoalogfactor.– Wecanapplythisalgorithmtohugedatasets.
20
(pause)
21
DecisionTreeLearning• Decisionstumps haveonly1rulebasedononly1feature.– Verylimitedclassofmodels:usuallynotveryaccurateformosttasks.
• Decisiontreesallowsequencesofsplits basedonmultiplefeatures.– Verygeneralclassofmodels:cangetveryhighaccuracy.– However,it’scomputationallyinfeasibletofindthebestdecisiontree.
• Mostcommondecisiontreelearningalgorithminpractice:– Greedyrecursivesplitting.
22
ExampleofGreedyRecursiveSplitting• Startwiththefulldataset:
Egg Milk …
0 0.7
1 0.7
0 0
1 0.6
1 0
2 0.6
0 1
2 0
0 0.3
1 0.6
2 0
Findthedecisionstumpwiththebestscore:
Splitintotwosmallerdatasetsbasedonstump:Egg Milk …
0 0
1 0
2 0
0 0.3
2 0
Egg Milk …
0 0.7
1 0.7
1 0.6
2 0.6
0 1
1 0.6
Sick?
1
1
0
1
0
1
1
1
0
0
1
Sick?
0
0
1
0
1
Sick?1
1
1
1
1
023
GreedyRecursiveSplittingWenowhaveadecisionstumpandtwodatasets:
Egg Milk … Sick?
0 0 0
1 0 0
2 0 1
0 0.3 0
2 0 1
Egg Milk … Sick?
0 0.7 1
1 0.7 1
1 0.6 1
2 0.6 1
0 1 1
1 0.6 0
Fitadecisionstumptoeachleaf’sdata.
24
GreedyRecursiveSplittingWenowhaveadecisionstumpandtwodatasets:
Egg Milk … Sick?
0 0 0
1 0 0
2 0 1
0 0.3 0
2 0 1
Egg Milk … Sick?
0 0.7 1
1 0.7 1
1 0.6 1
2 0.6 1
0 1 1
1 0.6 0
Fitadecisionstumptoeachleaf’sdata.Thenaddthesestumpstothetree.
25
GreedyRecursiveSplittingThisgivesa“depth2”decisiontree: Itsplitsthetwodatasetsintofourdatasets:
Egg Milk … Sick?
0 0 0
1 0 0
2 0 1
0 0.3 0
2 0 1
Egg Milk … Sick?
0 0.7 1
1 0.7 1
1 0.6 1
2 0.6 1
0 1 1
1 0.6 0
Egg Milk … Sick?
0 0 0
1 0 0
0 0.3 0
Egg Milk … Sick?
2 0 1
2 0 1
Egg Milk … Sick?
0 0.7 1
1 0.7 1
1 0.6 1
2 0.6 1
Egg Milk … Sick?
1 0.6 0
26
GreedyRecursiveSplittingWecouldtrytosplitthefourleavestomakea“depth3”decisiontree:
Wemightcontinuesplittinguntil:- Theleaveseachhaveonlyonelabel.- Wereachauser-definedmaximumdepth.
27
DiscussionofDecisionTreeLearning• Advantages:
– Interpretable.– Fasttolearn.– Veryfasttoclassify
• Disadvantages:– Hardtofindoptimalsetofrules.– Greedysplittingoftennotaccurate,requiresverydeeptrees.
• Issues:– Canyourevisitafeature?
• Yes,knowingotherinformationcouldmakefeaturerelevantagain.– Morecomplicatedrules?
• Yes,butsearchingforthebestrulegetsmuchmoreexpensive.– Isaccuracythebestscore?
• No,theremaybenosplitthatincreaseaccuracy.Alternative:informationgain (bonusslides).– Whatdepth?
28
Summary• Supervisedlearning:– Usingdatatowriteaprogrambasedoninput/outputexamples.
• Decisiontrees:predictingalabelusingasequenceofsimplerules.• Decisionstumps:simpledecisiontreethatisveryfasttofit.• Greedyrecursivesplitting:usesasequenceofstumpstofitatree.– Veryfastandinterpretable,butnotalwaysthemostaccurate.
29
OtherConsiderationsforFoodAllergyExample• Whattypesofpreprocessing mightwedo?
– Datacleaning:checkforandfixmissing/unreasonablevalues.– Summarystatistics:
• Canhelpidentify“unclean”data.• Correlationmightrevealanobviousdependence(“sick”ó “peanuts”).
– Datatransformations:• Converteverythingtosamescale?(e.g.,grams)• Addfoodsfromdaybefore?(maybe“sick”dependsonmultipledays)• Adddate?(maybewhatmakesyou“sick”changesovertime).
– Datavisualization:lookatascatterplotofeachfeatureandthelabel.• Maybethevisualizationwillshowsomethingweirdinthefeatures.• Maybethepatternisreallyobvious!
• Whatyoudomightdependonhowmuchdatayouhave:– Verylittledata:
• Representfoodbycommonallergicingredients(lactose,gluten,etc.)?– Lotsofdata:
• Usemorefine-grainedfeatures(breadfrombakeryvs.hamburgerbun)?
30
HowdowefitstumpsinO(dn logn)?• Let’ssaywe’retryingtofindthebestruleinvolvingmilk:
Egg Milk …
0 0.7
1 0.7
0 0
1 0.6
1 0
2 0.6
0 1
2 0
0 0.3
1 0.6
2 0
Sick?
1
1
0
1
0
1
1
1
0
0
1
Firstgrabthemilkcolumnandsortit(usingthesortpositionstore-arrangethesickcolumn). ThisstepcostsO(nlogn)duetosorting.
Now,we’llgothroughthemilkvaluesinorder,keepingtrackof#sickand#notsickthatareabove/belowthecurrentvalue.E.g.,#sickabove0.3is5.
Withthesecounts,accuracyscoreis(sumofmostcommonlabelaboveandbelow)/n.
Milk
0
0
0
0
0.3
0.6
0.6
0.6
0.7
0.7
1
Sick?
0
0
0
0
0
1
1
0
1
1
131
HowdowefitstumpsinO(dn logn)?
Milk
0
0
0
0
0.3
0.6
0.6
0.6
0.7
0.7
1
Sick?
0
0
0
0
0
1
1
0
1
1
1
Startwiththebaselinerule()whichisalways“satisfied”:Ifsatisfied,#sick=5and#not-sick=6.Ifnotsatisfied,#sick=0and#not-sick=0.Thisgivesaccuracyof(6+0)/n=6/11.
Nexttrytherule(milk>0),andupdatethecountsbasedonthese4rows:Ifsatisfied,#sick=5 and#not-sick=2.Ifnotsatisfied,#sick=0and#not-sick=4.Thisgivesaccuracyof(5+4)/n=9/11,whichisbetter.
Nexttrytherule(milk>0.3),andupdatethecountsbasedonthis1row:Ifsatisfied,#sick=5 and#not-sick=1.Ifnotsatisfied,#sick=0and#not-sick=5.Thisgivesaccuracyof(5+5)/n=10/11,whichisbetter.(andkeepgoinguntilyougettotheend…)
32
HowdowefitstumpsinO(dn logn)?
Milk
0
0
0
0
0.3
0.6
0.6
0.6
0.7
0.7
1
Sick?
0
0
0
0
0
1
1
0
1
1
1
Noticethatforeachrow,updatingthecountsonlycostsO(1).SincethereareO(n)rows,totalcostofupdatingcountsisO(n).
Insteadof2labels(sickvs.not-sick),considerthecaseof‘k’labels:- UpdatingthecountsstillcostsO(n),sinceeachrowhasonelabel.- Butcomputingthe‘max’acrossthelabelscostsO(k),socostisO(kn).
With‘k’labels,youcandecreasecostusinga“max-heap”datastructure:- CostofgettingmaxisO(1),costofupdatingheapforarowisO(logk).- Butk<=n(eachrowhasonlyonelabel).- SocostisinO(logn)foronerow.
SincetheaboveshowswecanfindbestruleinonecolumninO(nlogn),totalcosttofindbestruleacrossall‘d’columnsisO(nd logn).
33
Candecisiontreesre-visitafeature?• Yes.
Knowing(icecream>0.3)makessmallmilkquantitiesrelevant.34
Candecisiontreeshavemorecomplicatedrules?
• Yes:
• Butsearchingforbestrulecangetexpensive.
35
Doesbeinggreedyactuallyhurt?• Can’tyoujustgodeepertocorrectgreedydecisions?– Yes,butyouneedto“re-discover”ruleswithlessdata.
• Considerthatyouareallergictomilk(anddrinkthisoften),andalsogetsickwhenyou(rarely)combinedietcokewithmentos.
• Greedymethodshouldfirstsplitonmilk(helpsaccuracythemost):
36
Doesbeinggreedyactuallyhurt?• Can’tyoujustgodeepertocorrectgreedydecisions?– Yes,butyouneedto“re-discover”ruleswithlessdata.
• Considerthatyouareallergictomilk(anddrinkthisoften),andalsogetsickwhenyou(rarely)combinedietcokewithmentos.
• Greedymethodshouldfirstsplitonmilk(helpsaccuracythemost).• Non-greedymethodcouldgetsimplertree(splitonmilklater):
37
Whichscorefunctionshouldadecisiontreeused?
• Shouldn’twejustuseaccuracyscore?– Forleafs:yes,justmaximizeaccuracy.– Forinternalnodes:maybenot.
• Theremaybenosimplerulelike(egg>0.5)thatimprovesaccuracy.
• Mostcommonscoreinpractice:informationgain.– Choosesplitthatdecreasesentropy (“randomness”)oflabelsthemost.– Motivation:trytomakesplitdata“lessrandom”or“morepredictable”.
• Mightthenbeeasiertofindhigh-accuracyonthe“lessrandom”splitdata.
38
DecisionTreeswithProbabilisticPredictions• Often,we’llhavemultiple‘y’valuesateachleafnode.• Inthesecases,wemightreturnprobabilitiesinsteadofalabel.
• E.g.,ifintheleafnodewe5have“sick”objectsand1“notsick”:– Returnp(y=“sick”|xi)=5/6andp(y=“notsick”|xi)=1/6.
• Ingeneral,anaturalestimateoftheprobabilitiesattheleafnodes:– Let‘nk’bethenumberofobjectsthatarrivetoleafnode‘k’.– Let‘nkc’bethenumberoftimes(y==c)intheobjectsatleafnode‘k’.– Maximumlikelihoodestimateforthisleafisp(y=c|xi)=nkc/nk.
39
AlternativeStoppingRules• Therearemorecomplicatedrulesfordecidingwhen*not*tosplit.
• Rulesbasedonminimumsamplesize.– Don’tsplitanynodeswherethenumberofobjectsislessthansome‘m’.– Don’tsplitanynodesthatcreatechildrenwithlessthan‘m’objects.
• Thesetypesofrulestrytomakesurethatyouhaveenoughdatatojustifydecisions.
• Alternately,youcanuseavalidationset(seenextlecture):– Don’tsplitthenodeifitdecreasesanapproximationoftestaccuracy.
40