CPSC 340: Machine Learning and Data Mining
Transcript of CPSC 340: Machine Learning and Data Mining
![Page 1: CPSC 340: Machine Learning and Data Mining](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169f08e11a7b741a34d0c60/html5/thumbnails/1.jpg)
CPSC340:MachineLearningandDataMining
DecisionTrees
OriginalversionoftheseslidesbyMarkSchmidt,withmodificationsbyMikeGelbart. 1
![Page 2: CPSC 340: Machine Learning and Data Mining](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169f08e11a7b741a34d0c60/html5/thumbnails/2.jpg)
Admin• Assignment0 isdueWednesdayat9pm(in2days)• Assignment1shouldbereleasedWednesday,dueaweeklater– Ifyouwanttoworkwithapartner,youbothmustrequestitBEFOREa1release– InstructionsintheHomeworkSubmissionInstructionsdocument
• Importantwebpages:– https://www.cs.ubc.ca/getacct/– https://github.ugrad.cs.ubc.ca/CPSC340-2017W-T2/home– https://piazza.com/class/j9uk5ecmb7e4ks
• Tutorialsandofficehoursstartthisweek.– Seecoursehomepagefortutorialtopicsandofficehoursschedule.
• Auditing– Noroomforofficialauditors.– Unofficialauditors,pleasedonottakeseatsifothersarestanding.
2
![Page 3: CPSC 340: Machine Learning and Data Mining](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169f08e11a7b741a34d0c60/html5/thumbnails/3.jpg)
LastTime:DataRepresentationandExploration• Wediscussedobject-featurerepresentation:– Examples:anothernamewe’lluseforobjects.
• Wediscussedsummarystatistics andvisualizingdata.
Age Job? City Rating Income
23 Yes Van A 22,000.0023 Yes Bur BBB 21,000.0022 No Van CC 0.0025 Yes Sur AAA 57,000.00
3
![Page 4: CPSC 340: Machine Learning and Data Mining](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169f08e11a7b741a34d0c60/html5/thumbnails/4.jpg)
MotivatingExample:FoodAllergies• Youfrequentlystartgettinganupsetstomach• Yoususpectanadult-onsetfoodallergy.
4
![Page 5: CPSC 340: Machine Learning and Data Mining](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169f08e11a7b741a34d0c60/html5/thumbnails/5.jpg)
MotivatingExample:FoodAllergies• Tosolvethemystery,youstartafoodjournal:
• Butit’shardtofindthepattern:– Youcan’tisolateandonlyeatonefoodatatime.– Youmaybeallergictomorethanonefood.– Thequantitymatters:asmallamountmaybeok.– Youmaybeallergictospecificinteractions.
Egg Milk Fish Wheat Shellfish Peanuts … Sick?
0 0.7 0 0.3 0 0 1
0.3 0.7 0 0.6 0 0.01 1
0 0 0 0.8 0 0 0
0.3 0.7 1.2 0 0.10 0.01 1
0.3 0 1.2 0.3 0.10 0.01 1
5
![Page 6: CPSC 340: Machine Learning and Data Mining](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169f08e11a7b741a34d0c60/html5/thumbnails/6.jpg)
SupervisedLearning• Wecanformulatethisassupervisedlearning:
• Inputforanobject (dayoftheweek)isasetoffeatures (quantitiesoffood).• Outputisadesiredclasslabel(whetherornotwegotsick).• Goalofsupervisedlearning:
– Usedatatofindamodelthatoutputstherightlabelbasedonthefeatures.– Modelpredictswhetherfoodswillmakeyousick(evenwithnewcombinations).
Egg Milk Fish Wheat Shellfish Peanuts …
0 0.7 0 0.3 0 0
0.3 0.7 0 0.6 0 0.01
0 0 0 0.8 0 0
0.3 0.7 1.2 0 0.10 0.01
0.3 0 1.2 0.3 0.10 0.01
Sick?
1
1
0
1
1
6
![Page 7: CPSC 340: Machine Learning and Data Mining](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169f08e11a7b741a34d0c60/html5/thumbnails/7.jpg)
SupervisedLearning• Generalsupervisedlearning problem:– Takefeaturesofobjectsandcorrespondinglabelsasinputs.– Findamodelthatcanaccuratelypredictthelabelsofnewobjects.
• Thisisthemostsuccessfulmachinelearningtechnique:– Spamfiltering,opticalcharacterrecognition,MicrosoftKinect,speechrecognition,classifyingtumours,etc.
• We’llfirstfocusoncategoricallabels,whichiscalled“classification”.– Themodelisacalleda“classifier”.
7
![Page 8: CPSC 340: Machine Learning and Data Mining](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169f08e11a7b741a34d0c60/html5/thumbnails/8.jpg)
NaïveSupervisedLearning:“PredictMode”
• Averynaïvesupervisedlearningmethod:– Counthowmanytimeseachlabeloccurredinthedata(4vs.1above).– Alwayspredictthemostcommonlabel,the“mode”(“sick”above).
• Thisignoresthefeatures,soisonlyaccurateifweonlyhave1label.• Thereisnounique“right”waytousethefeatures.– Todaywe’llconsideraclassicwayknownasdecisiontreelearning.
Egg Milk Fish Wheat Shellfish Peanuts …
0 0.7 0 0.3 0 0
0.3 0.7 0 0.6 0 0.01
0 0 0 0.8 0 0
0.3 0.7 1.2 0 0.10 0.01
0.3 0 1.2 0.3 0.10 0.01
Sick?
1
1
0
1
1
8
![Page 9: CPSC 340: Machine Learning and Data Mining](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169f08e11a7b741a34d0c60/html5/thumbnails/9.jpg)
DecisionTrees• Decisiontreesaresimpleprogramsconsistingof:
– Anestedsequenceof“if-else”decisionsbasedonthefeatures (splittingrules).– Aclasslabelasareturnvalueattheendofeachsequence.
• Exampledecisiontree:
if(milk>0.5){
return‘sick’}else{
if(egg>1)return‘sick’
elsereturn‘notsick’
}
Candrawsequencesofdecisionsasatree:
9
![Page 10: CPSC 340: Machine Learning and Data Mining](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169f08e11a7b741a34d0c60/html5/thumbnails/10.jpg)
SupervisedLearningasWritingAProgram• Therearemanypossibledecisiontrees.– We’regoingtosearchforonethatisgoodatoursupervisedlearningproblem.
• Soourinputisdataandtheoutputwillbeaprogram.– Thisiscalled“training”thesupervisedlearningmodel.– Differentthanusualinput/outputspecificationforwritingaprogram.
• SupervisedlearningisusefulwhenyouhavelotsoflabeleddataBUT:1. problemistoocomplicatedtowriteaprogramourselves,or2. humanexpertcan’texplainwhyyouassigncertainlabels,or3. wedon’thaveahumanexpertfortheproblem.
10
![Page 11: CPSC 340: Machine Learning and Data Mining](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169f08e11a7b741a34d0c60/html5/thumbnails/11.jpg)
LearningADecisionStump• We’llstartwith"decisionstumps”:– Simpledecisiontreewith1splittingrulebasedonthresholding1feature.
• Howdowefindthebest“rule”(feature,threshold,andleaflabels)?1. Definea‘score’fortherule.2. Searchfortherulewiththebestscore.
11
![Page 12: CPSC 340: Machine Learning and Data Mining](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169f08e11a7b741a34d0c60/html5/thumbnails/12.jpg)
DecisionStump:AccuracyScore• Mostintuitivescore:classificationaccuracy.– “Ifweusethisrule,howmanyobjectsdowelabelcorrectly?”
• Computingclassificationaccuracyfor(egg>1):– Findmostcommonlabelsifweusethisrule:
• When(egg>1),wewere“sick”bothtimes.• When(egg<=1),wewere“notsick”threeoutoffourtimes.
– Computeaccuracy:• Rule(egg>1)iscorrecton5/6objects.
• Scoresofotherrules:– (milk>0.5)obtainsloweraccuracyof4/6.– (egg>0)obtainsoptimalaccuracyof6/6.– ()obtains“baseline”accuracyof3/6,asdoes(egg>2).
Egg Milk Fish …
1 0.7 0
2 0.7 0
0 0 0
0 0.7 1.2
2 0 1.2
0 0 0
Sick?
1
1
0
0
1
0
12
![Page 13: CPSC 340: Machine Learning and Data Mining](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169f08e11a7b741a34d0c60/html5/thumbnails/13.jpg)
DecisionStump:RuleSearch(Attempt1)• Accuracy“score”evaluatesqualityofarule.– Findthebestrulebymaximizingscore.
• Attempt1(exhaustivesearch):
• Asyougo,keeptrackofthehighestscore.• Returnhighest-scoringrule (variable,threshold,andleafvalues).
Computescoreof(egg>0) Computescoreof(milk>0) …Computescoreof(egg>0.01) Computescoreof(milk>0.01) …Computescoreof(egg>0.02) Computescoreof(milk>0.02) …Computescoreof(egg>0.03) Computescoreof(milk>0.03) …… … …Computescoreof(egg>99.99) Computescoreof(milk>0.99) …
13
![Page 14: CPSC 340: Machine Learning and Data Mining](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169f08e11a7b741a34d0c60/html5/thumbnails/14.jpg)
SupervisedLearningNotation(MEMORIZETHIS)
• Featurematrix‘X’ hasrowsasobjects,columnsasfeatures.– xij isfeature‘j’forobject‘i’(quantityoffood‘j’onday‘i’).– xi isthelistofallfeaturesforobject‘i’(allthequantitiesonday‘i’).– xj iscolumn‘j’ofthematrix (thevalueoffeature‘j’acrossallobjects).
• Labelvector‘y’ containsthelabelsoftheobjects.– yi isthelabelofobject ‘i’ (1for“sick”,0for“notsick”).
Egg Milk Fish Wheat Shellfish Peanuts
0 0.7 0 0.3 0 0
0.3 0.7 0 0.6 0 0.01
0 0 0 0.8 0 0
0.3 0.7 1.2 0 0.10 0.01
0.3 0 1.2 0.3 0.10 0.01
Sick?
1
1
0
1
1
14
![Page 15: CPSC 340: Machine Learning and Data Mining](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169f08e11a7b741a34d0c60/html5/thumbnails/15.jpg)
SupervisedLearningNotation(MEMORIZETHIS)
• Trainingphase:– Use‘X’and‘y’tofinda‘model’(likeadecisionstump).
• Predictionphase:– Givenanobjectxi,usethe‘model’topredictalabel‘yhati’ (“sick”or“notsick”).
• Trainingerror:– Fractionoftimesourprediction‘yhati’doesnotequalthetrueyi label.
Egg Milk Fish Wheat Shellfish Peanuts
0 0.7 0 0.3 0 0
0.3 0.7 0 0.6 0 0.01
0 0 0 0.8 0 0
0.3 0.7 1.2 0 0.10 0.01
0.3 0 1.2 0.3 0.10 0.01
Sick?
1
1
0
1
1
15
![Page 16: CPSC 340: Machine Learning and Data Mining](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169f08e11a7b741a34d0c60/html5/thumbnails/16.jpg)
DecisionStumpLearningPseudo-Code
16
![Page 17: CPSC 340: Machine Learning and Data Mining](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169f08e11a7b741a34d0c60/html5/thumbnails/17.jpg)
CostofDecisionStumps(Attempt1)• Howmuchdoesthiscost?• Assumewehave:– ‘n’objects(daysthatwemeasured).– ‘d’features(foodsthatwemeasured).– ‘k’thresholds(>0,>0.01,>0.02,…)
• ComputingthescoreofonerulecostsO(n):– Weneedtogothroughall‘n’examples.– Seenotesonwebpageforreviewof“O(n)”notation.
• Tocomputescoresford*krules,totalcostisO(ndk).– But‘k’mightbehuge
• Canwedobetter?17
![Page 18: CPSC 340: Machine Learning and Data Mining](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169f08e11a7b741a34d0c60/html5/thumbnails/18.jpg)
SpeedingupRuleSearch• Wecanignorerulesoutsidefeatureranges:– E.g., weneverhave(egg>50)inourdata.– Theserulescanneverimproveaccuracy.– Restrictthresholdstorangeoffeatures.
• Mostofthethresholdsgivethesamescore.– Ifweneverhave(0.5<egg<1)inthedata,
• then(egg<0.6)and(egg<0.9)havethesamescore.
– Restrictthresholdstovaluesindata.
18
![Page 19: CPSC 340: Machine Learning and Data Mining](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169f08e11a7b741a34d0c60/html5/thumbnails/19.jpg)
DecisionStump:RuleSearch(Attempt2)• Attempt2(searchonlyoverfeaturesindata):
• Nowatmost‘n’thresholdsforeachfeature.• WeonlyconsiderO(nd)rulesinsteadofO(dk)rules:– TotalcostchangesfromO(ndk)toO(n2d).
Computescoreof(eggs>0) Computescoreof(milk>0.5) …Computescoreof(eggs>1) Computescoreof(milk>0.7) …Computescoreof(eggs>2) Computescoreof(milk>1) …Computescoreof(eggs>3) Computescoreof(milk>1.25) …Computescoreof(eggs>4) …
19
![Page 20: CPSC 340: Machine Learning and Data Mining](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169f08e11a7b741a34d0c60/html5/thumbnails/20.jpg)
DecisionStump:RuleSearch(Attempt3)• Dowehavetocomputethescorefromscratch?
– Rule(egg>1)and(egg>2)havesamedecisions,exceptwhen(egg==2).• Wecanactuallycomputethebestruleinvolving‘egg’inO(nlogn):
– Sorttheexamplesbasedon‘egg’,andusethesepositionstore-arrange‘y’.– Gothroughthesortedvaluesinorder,updatingthecountsof#sickand#not-sickthat
bothsatisfyanddon’tsatisfytherules.– Withthesecounts,it’seasytocomputetheclassificationaccuracy(seebonusslide).
• SortingcostsO(nlogn)perfeature.• TotalcostofupdatingcountsisO(n)perfeature.• TotalcostisreducedfromO(n2d)toO(nd logn).• Thisisagoodruntime:
– O(nd)isthesizeofdata,soO(nd logn)is sameaslookingatdata,uptoalogfactor.– Wecanapplythisalgorithmtohugedatasets.
20
![Page 21: CPSC 340: Machine Learning and Data Mining](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169f08e11a7b741a34d0c60/html5/thumbnails/21.jpg)
(pause)
21
![Page 22: CPSC 340: Machine Learning and Data Mining](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169f08e11a7b741a34d0c60/html5/thumbnails/22.jpg)
DecisionTreeLearning• Decisionstumps haveonly1rulebasedononly1feature.– Verylimitedclassofmodels:usuallynotveryaccurateformosttasks.
• Decisiontreesallowsequencesofsplits basedonmultiplefeatures.– Verygeneralclassofmodels:cangetveryhighaccuracy.– However,it’scomputationallyinfeasibletofindthebestdecisiontree.
• Mostcommondecisiontreelearningalgorithminpractice:– Greedyrecursivesplitting.
22
![Page 23: CPSC 340: Machine Learning and Data Mining](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169f08e11a7b741a34d0c60/html5/thumbnails/23.jpg)
ExampleofGreedyRecursiveSplitting• Startwiththefulldataset:
Egg Milk …
0 0.7
1 0.7
0 0
1 0.6
1 0
2 0.6
0 1
2 0
0 0.3
1 0.6
2 0
Findthedecisionstumpwiththebestscore:
Splitintotwosmallerdatasetsbasedonstump:Egg Milk …
0 0
1 0
2 0
0 0.3
2 0
Egg Milk …
0 0.7
1 0.7
1 0.6
2 0.6
0 1
1 0.6
Sick?
1
1
0
1
0
1
1
1
0
0
1
Sick?
0
0
1
0
1
Sick?1
1
1
1
1
023
![Page 24: CPSC 340: Machine Learning and Data Mining](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169f08e11a7b741a34d0c60/html5/thumbnails/24.jpg)
GreedyRecursiveSplittingWenowhaveadecisionstumpandtwodatasets:
Egg Milk … Sick?
0 0 0
1 0 0
2 0 1
0 0.3 0
2 0 1
Egg Milk … Sick?
0 0.7 1
1 0.7 1
1 0.6 1
2 0.6 1
0 1 1
1 0.6 0
Fitadecisionstumptoeachleaf’sdata.
24
![Page 25: CPSC 340: Machine Learning and Data Mining](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169f08e11a7b741a34d0c60/html5/thumbnails/25.jpg)
GreedyRecursiveSplittingWenowhaveadecisionstumpandtwodatasets:
Egg Milk … Sick?
0 0 0
1 0 0
2 0 1
0 0.3 0
2 0 1
Egg Milk … Sick?
0 0.7 1
1 0.7 1
1 0.6 1
2 0.6 1
0 1 1
1 0.6 0
Fitadecisionstumptoeachleaf’sdata.Thenaddthesestumpstothetree.
25
![Page 26: CPSC 340: Machine Learning and Data Mining](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169f08e11a7b741a34d0c60/html5/thumbnails/26.jpg)
GreedyRecursiveSplittingThisgivesa“depth2”decisiontree: Itsplitsthetwodatasetsintofourdatasets:
Egg Milk … Sick?
0 0 0
1 0 0
2 0 1
0 0.3 0
2 0 1
Egg Milk … Sick?
0 0.7 1
1 0.7 1
1 0.6 1
2 0.6 1
0 1 1
1 0.6 0
Egg Milk … Sick?
0 0 0
1 0 0
0 0.3 0
Egg Milk … Sick?
2 0 1
2 0 1
Egg Milk … Sick?
0 0.7 1
1 0.7 1
1 0.6 1
2 0.6 1
Egg Milk … Sick?
1 0.6 0
26
![Page 27: CPSC 340: Machine Learning and Data Mining](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169f08e11a7b741a34d0c60/html5/thumbnails/27.jpg)
GreedyRecursiveSplittingWecouldtrytosplitthefourleavestomakea“depth3”decisiontree:
Wemightcontinuesplittinguntil:- Theleaveseachhaveonlyonelabel.- Wereachauser-definedmaximumdepth.
27
![Page 28: CPSC 340: Machine Learning and Data Mining](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169f08e11a7b741a34d0c60/html5/thumbnails/28.jpg)
DiscussionofDecisionTreeLearning• Advantages:
– Interpretable.– Fasttolearn.– Veryfasttoclassify
• Disadvantages:– Hardtofindoptimalsetofrules.– Greedysplittingoftennotaccurate,requiresverydeeptrees.
• Issues:– Canyourevisitafeature?
• Yes,knowingotherinformationcouldmakefeaturerelevantagain.– Morecomplicatedrules?
• Yes,butsearchingforthebestrulegetsmuchmoreexpensive.– Isaccuracythebestscore?
• No,theremaybenosplitthatincreaseaccuracy.Alternative:informationgain (bonusslides).– Whatdepth?
28
![Page 29: CPSC 340: Machine Learning and Data Mining](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169f08e11a7b741a34d0c60/html5/thumbnails/29.jpg)
Summary• Supervisedlearning:– Usingdatatowriteaprogrambasedoninput/outputexamples.
• Decisiontrees:predictingalabelusingasequenceofsimplerules.• Decisionstumps:simpledecisiontreethatisveryfasttofit.• Greedyrecursivesplitting:usesasequenceofstumpstofitatree.– Veryfastandinterpretable,butnotalwaysthemostaccurate.
29
![Page 30: CPSC 340: Machine Learning and Data Mining](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169f08e11a7b741a34d0c60/html5/thumbnails/30.jpg)
OtherConsiderationsforFoodAllergyExample• Whattypesofpreprocessing mightwedo?
– Datacleaning:checkforandfixmissing/unreasonablevalues.– Summarystatistics:
• Canhelpidentify“unclean”data.• Correlationmightrevealanobviousdependence(“sick”ó “peanuts”).
– Datatransformations:• Converteverythingtosamescale?(e.g.,grams)• Addfoodsfromdaybefore?(maybe“sick”dependsonmultipledays)• Adddate?(maybewhatmakesyou“sick”changesovertime).
– Datavisualization:lookatascatterplotofeachfeatureandthelabel.• Maybethevisualizationwillshowsomethingweirdinthefeatures.• Maybethepatternisreallyobvious!
• Whatyoudomightdependonhowmuchdatayouhave:– Verylittledata:
• Representfoodbycommonallergicingredients(lactose,gluten,etc.)?– Lotsofdata:
• Usemorefine-grainedfeatures(breadfrombakeryvs.hamburgerbun)?
30
![Page 31: CPSC 340: Machine Learning and Data Mining](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169f08e11a7b741a34d0c60/html5/thumbnails/31.jpg)
HowdowefitstumpsinO(dn logn)?• Let’ssaywe’retryingtofindthebestruleinvolvingmilk:
Egg Milk …
0 0.7
1 0.7
0 0
1 0.6
1 0
2 0.6
0 1
2 0
0 0.3
1 0.6
2 0
Sick?
1
1
0
1
0
1
1
1
0
0
1
Firstgrabthemilkcolumnandsortit(usingthesortpositionstore-arrangethesickcolumn). ThisstepcostsO(nlogn)duetosorting.
Now,we’llgothroughthemilkvaluesinorder,keepingtrackof#sickand#notsickthatareabove/belowthecurrentvalue.E.g.,#sickabove0.3is5.
Withthesecounts,accuracyscoreis(sumofmostcommonlabelaboveandbelow)/n.
Milk
0
0
0
0
0.3
0.6
0.6
0.6
0.7
0.7
1
Sick?
0
0
0
0
0
1
1
0
1
1
131
![Page 32: CPSC 340: Machine Learning and Data Mining](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169f08e11a7b741a34d0c60/html5/thumbnails/32.jpg)
HowdowefitstumpsinO(dn logn)?
Milk
0
0
0
0
0.3
0.6
0.6
0.6
0.7
0.7
1
Sick?
0
0
0
0
0
1
1
0
1
1
1
Startwiththebaselinerule()whichisalways“satisfied”:Ifsatisfied,#sick=5and#not-sick=6.Ifnotsatisfied,#sick=0and#not-sick=0.Thisgivesaccuracyof(6+0)/n=6/11.
Nexttrytherule(milk>0),andupdatethecountsbasedonthese4rows:Ifsatisfied,#sick=5 and#not-sick=2.Ifnotsatisfied,#sick=0and#not-sick=4.Thisgivesaccuracyof(5+4)/n=9/11,whichisbetter.
Nexttrytherule(milk>0.3),andupdatethecountsbasedonthis1row:Ifsatisfied,#sick=5 and#not-sick=1.Ifnotsatisfied,#sick=0and#not-sick=5.Thisgivesaccuracyof(5+5)/n=10/11,whichisbetter.(andkeepgoinguntilyougettotheend…)
32
![Page 33: CPSC 340: Machine Learning and Data Mining](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169f08e11a7b741a34d0c60/html5/thumbnails/33.jpg)
HowdowefitstumpsinO(dn logn)?
Milk
0
0
0
0
0.3
0.6
0.6
0.6
0.7
0.7
1
Sick?
0
0
0
0
0
1
1
0
1
1
1
Noticethatforeachrow,updatingthecountsonlycostsO(1).SincethereareO(n)rows,totalcostofupdatingcountsisO(n).
Insteadof2labels(sickvs.not-sick),considerthecaseof‘k’labels:- UpdatingthecountsstillcostsO(n),sinceeachrowhasonelabel.- Butcomputingthe‘max’acrossthelabelscostsO(k),socostisO(kn).
With‘k’labels,youcandecreasecostusinga“max-heap”datastructure:- CostofgettingmaxisO(1),costofupdatingheapforarowisO(logk).- Butk<=n(eachrowhasonlyonelabel).- SocostisinO(logn)foronerow.
SincetheaboveshowswecanfindbestruleinonecolumninO(nlogn),totalcosttofindbestruleacrossall‘d’columnsisO(nd logn).
33
![Page 34: CPSC 340: Machine Learning and Data Mining](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169f08e11a7b741a34d0c60/html5/thumbnails/34.jpg)
Candecisiontreesre-visitafeature?• Yes.
Knowing(icecream>0.3)makessmallmilkquantitiesrelevant.34
![Page 35: CPSC 340: Machine Learning and Data Mining](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169f08e11a7b741a34d0c60/html5/thumbnails/35.jpg)
Candecisiontreeshavemorecomplicatedrules?
• Yes:
• Butsearchingforbestrulecangetexpensive.
35
![Page 36: CPSC 340: Machine Learning and Data Mining](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169f08e11a7b741a34d0c60/html5/thumbnails/36.jpg)
Doesbeinggreedyactuallyhurt?• Can’tyoujustgodeepertocorrectgreedydecisions?– Yes,butyouneedto“re-discover”ruleswithlessdata.
• Considerthatyouareallergictomilk(anddrinkthisoften),andalsogetsickwhenyou(rarely)combinedietcokewithmentos.
• Greedymethodshouldfirstsplitonmilk(helpsaccuracythemost):
36
![Page 37: CPSC 340: Machine Learning and Data Mining](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169f08e11a7b741a34d0c60/html5/thumbnails/37.jpg)
Doesbeinggreedyactuallyhurt?• Can’tyoujustgodeepertocorrectgreedydecisions?– Yes,butyouneedto“re-discover”ruleswithlessdata.
• Considerthatyouareallergictomilk(anddrinkthisoften),andalsogetsickwhenyou(rarely)combinedietcokewithmentos.
• Greedymethodshouldfirstsplitonmilk(helpsaccuracythemost).• Non-greedymethodcouldgetsimplertree(splitonmilklater):
37
![Page 38: CPSC 340: Machine Learning and Data Mining](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169f08e11a7b741a34d0c60/html5/thumbnails/38.jpg)
Whichscorefunctionshouldadecisiontreeused?
• Shouldn’twejustuseaccuracyscore?– Forleafs:yes,justmaximizeaccuracy.– Forinternalnodes:maybenot.
• Theremaybenosimplerulelike(egg>0.5)thatimprovesaccuracy.
• Mostcommonscoreinpractice:informationgain.– Choosesplitthatdecreasesentropy (“randomness”)oflabelsthemost.– Motivation:trytomakesplitdata“lessrandom”or“morepredictable”.
• Mightthenbeeasiertofindhigh-accuracyonthe“lessrandom”splitdata.
38
![Page 39: CPSC 340: Machine Learning and Data Mining](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169f08e11a7b741a34d0c60/html5/thumbnails/39.jpg)
DecisionTreeswithProbabilisticPredictions• Often,we’llhavemultiple‘y’valuesateachleafnode.• Inthesecases,wemightreturnprobabilitiesinsteadofalabel.
• E.g.,ifintheleafnodewe5have“sick”objectsand1“notsick”:– Returnp(y=“sick”|xi)=5/6andp(y=“notsick”|xi)=1/6.
• Ingeneral,anaturalestimateoftheprobabilitiesattheleafnodes:– Let‘nk’bethenumberofobjectsthatarrivetoleafnode‘k’.– Let‘nkc’bethenumberoftimes(y==c)intheobjectsatleafnode‘k’.– Maximumlikelihoodestimateforthisleafisp(y=c|xi)=nkc/nk.
39
![Page 40: CPSC 340: Machine Learning and Data Mining](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169f08e11a7b741a34d0c60/html5/thumbnails/40.jpg)
AlternativeStoppingRules• Therearemorecomplicatedrulesfordecidingwhen*not*tosplit.
• Rulesbasedonminimumsamplesize.– Don’tsplitanynodeswherethenumberofobjectsislessthansome‘m’.– Don’tsplitanynodesthatcreatechildrenwithlessthan‘m’objects.
• Thesetypesofrulestrytomakesurethatyouhaveenoughdatatojustifydecisions.
• Alternately,youcanuseavalidationset(seenextlecture):– Don’tsplitthenodeifitdecreasesanapproximationoftestaccuracy.
40