CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L4.pdf · –Collect more data until...
Transcript of CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L4.pdf · –Collect more data until...
![Page 1: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L4.pdf · –Collect more data until you coincidentally get significance level you want. –Try different ways to measure](https://reader030.fdocuments.net/reader030/viewer/2022041100/5ed72ffec30795314c175c40/html5/thumbnails/1.jpg)
CPSC340:MachineLearningandDataMining
FundamentalsofLearningFall2019
![Page 2: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L4.pdf · –Collect more data until you coincidentally get significance level you want. –Try different ways to measure](https://reader030.fdocuments.net/reader030/viewer/2022041100/5ed72ffec30795314c175c40/html5/thumbnails/2.jpg)
Admin• Assignment1 isdueWednesday:youshouldbealmostdone.• Waitinglistpeople:everyoneshouldbeinsoon?• Coursewebpage:– https://www.cs.ubc.ca/~fwood/CS340/
• Auditors:– BringyourformsattheendofclassFriday,assumingweclearwaitlist.
• Exchangestudents:– Ifyouarestillhavingtroubleregistering,bringyourformsFriday.– ContactusonPiazzaaboutgettingregisteredforGradescope.
• Midtermconfirmed(14/Feb;6pm->8pm,Wesbrook 100).
![Page 3: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L4.pdf · –Collect more data until you coincidentally get significance level you want. –Try different ways to measure](https://reader030.fdocuments.net/reader030/viewer/2022041100/5ed72ffec30795314c175c40/html5/thumbnails/3.jpg)
LastTime:SupervisedLearningNotation
• Featurematrix‘X’ hasrowsasexamples,columnsasfeatures.– xij isfeature‘j’forexample‘i’(quantityoffood‘j’onday‘i’).– xi isthelistofallfeaturesforexample‘i’(allthequantitiesonday‘i’).– xj iscolumn‘j’ofthematrix (thevalueoffeature‘j’acrossallexamples).
• Labelvector‘y’ containsthelabelsoftheexamples.– yi isthelabelofexample‘i’ (1for“sick”,0for“notsick”).
Egg Milk Fish Wheat Shellfish Peanuts
0 0.7 0 0.3 0 0
0.3 0.7 0 0.6 0 0.01
0 0 0 0.8 0 0
0.3 0.7 1.2 0 0.10 0.01
0.3 0 1.2 0.3 0.10 0.01
Sick?
1
1
0
1
1
![Page 4: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L4.pdf · –Collect more data until you coincidentally get significance level you want. –Try different ways to measure](https://reader030.fdocuments.net/reader030/viewer/2022041100/5ed72ffec30795314c175c40/html5/thumbnails/4.jpg)
SupervisedLearningApplication• Wemotivatedsupervisedlearningbythe“foodallergy”example.
• Butwecanusesupervisedlearningforanyinput:output mapping.– E-mailspamfiltering.– Opticalcharacterrecognitiononscanners.– Recognizingfacesinpictures.– Recognizingtumoursinmedicalimages.– Speechrecognitiononphones.– Yourprobleminindustry/research?
![Page 5: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L4.pdf · –Collect more data until you coincidentally get significance level you want. –Try different ways to measure](https://reader030.fdocuments.net/reader030/viewer/2022041100/5ed72ffec30795314c175c40/html5/thumbnails/5.jpg)
Motivation:DetermineHomeCity• Wearegivendatafrom248homes.• Foreachhome/example,wehavethesefeatures:– Elevation.– Year.– Bathrooms– Bedrooms.– Price.– Squarefeet.
• GoalistobuildaprogramthatpredictsSF orNY.
Thisexampleandimagesofitcomefrom:http://www.r2d3.us/visual-intro-to-machine-learning-part-1
![Page 6: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L4.pdf · –Collect more data until you coincidentally get significance level you want. –Try different ways to measure](https://reader030.fdocuments.net/reader030/viewer/2022041100/5ed72ffec30795314c175c40/html5/thumbnails/6.jpg)
PlottingElevation
![Page 7: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L4.pdf · –Collect more data until you coincidentally get significance level you want. –Try different ways to measure](https://reader030.fdocuments.net/reader030/viewer/2022041100/5ed72ffec30795314c175c40/html5/thumbnails/7.jpg)
SimpleDecisionStump
![Page 8: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L4.pdf · –Collect more data until you coincidentally get significance level you want. –Try different ways to measure](https://reader030.fdocuments.net/reader030/viewer/2022041100/5ed72ffec30795314c175c40/html5/thumbnails/8.jpg)
ScatterplotArray
![Page 9: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L4.pdf · –Collect more data until you coincidentally get significance level you want. –Try different ways to measure](https://reader030.fdocuments.net/reader030/viewer/2022041100/5ed72ffec30795314c175c40/html5/thumbnails/9.jpg)
ScatterplotArray
![Page 10: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L4.pdf · –Collect more data until you coincidentally get significance level you want. –Try different ways to measure](https://reader030.fdocuments.net/reader030/viewer/2022041100/5ed72ffec30795314c175c40/html5/thumbnails/10.jpg)
PlottingElevationandPrice/SqFt
![Page 11: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L4.pdf · –Collect more data until you coincidentally get significance level you want. –Try different ways to measure](https://reader030.fdocuments.net/reader030/viewer/2022041100/5ed72ffec30795314c175c40/html5/thumbnails/11.jpg)
SimpleDecisionTreeClassification
![Page 12: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L4.pdf · –Collect more data until you coincidentally get significance level you want. –Try different ways to measure](https://reader030.fdocuments.net/reader030/viewer/2022041100/5ed72ffec30795314c175c40/html5/thumbnails/12.jpg)
SimpleDecisionTreeClassification
![Page 13: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L4.pdf · –Collect more data until you coincidentally get significance level you want. –Try different ways to measure](https://reader030.fdocuments.net/reader030/viewer/2022041100/5ed72ffec30795314c175c40/html5/thumbnails/13.jpg)
Howdoesthedepthaffectaccuracy?
Thisisagoodstart(>75%accuracy).
![Page 14: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L4.pdf · –Collect more data until you coincidentally get significance level you want. –Try different ways to measure](https://reader030.fdocuments.net/reader030/viewer/2022041100/5ed72ffec30795314c175c40/html5/thumbnails/14.jpg)
Howdoesthedepthaffectaccuracy?
Startsplittingthedatarecursively…
![Page 15: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L4.pdf · –Collect more data until you coincidentally get significance level you want. –Try different ways to measure](https://reader030.fdocuments.net/reader030/viewer/2022041100/5ed72ffec30795314c175c40/html5/thumbnails/15.jpg)
Howdoesthedepthaffectaccuracy?
Accuracykeepsincreasingasweadddepth.
![Page 16: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L4.pdf · –Collect more data until you coincidentally get significance level you want. –Try different ways to measure](https://reader030.fdocuments.net/reader030/viewer/2022041100/5ed72ffec30795314c175c40/html5/thumbnails/16.jpg)
Howdoesthedepthaffectaccuracy?
Eventually,wecanperfectlyclassifyallofourdata.
![Page 17: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L4.pdf · –Collect more data until you coincidentally get significance level you want. –Try different ways to measure](https://reader030.fdocuments.net/reader030/viewer/2022041100/5ed72ffec30795314c175c40/html5/thumbnails/17.jpg)
Trainingvs.TestingError• Withthisdecisiontree,‘trainingaccuracy’is1.
– It perfectlylabelsthedataweusedtomakethetree.• Wearenowgivenfeaturesfor217newhomes.• Whatisthe‘testingaccuracy’onthenewdata?
– Howdoesitdoondatanotused tomakethetree?
• Overfitting:loweraccuracyonnewdata.– Ourrulesgottoospecifictoourexacttrainingdataset.– Someofthe“deep”splitsonlyuseafewexamples(bad“couponcollecting”).
![Page 18: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L4.pdf · –Collect more data until you coincidentally get significance level you want. –Try different ways to measure](https://reader030.fdocuments.net/reader030/viewer/2022041100/5ed72ffec30795314c175c40/html5/thumbnails/18.jpg)
SupervisedLearningNotation• Wearegiventrainingdata whereweknowlabels:
• Butthereisalsotestingdata wewanttolabel:
Egg Milk Fish Wheat Shellfish Peanuts …
0 0.7 0 0.3 0 0
0.3 0.7 0 0.6 0 0.01
0 0 0 0.8 0 0
0.3 0.7 1.2 0 0.10 0.01
0.3 0 1.2 0.3 0.10 0.01
Sick?
1
1
0
1
1
X= y=
Egg Milk Fish Wheat Shellfish Peanuts …
0.5 0 1 0.6 2 1
0 0.7 0 1 0 0
3 1 0 0.5 0 0
Sick?
?
?
?
𝑋"= 𝑦$=
![Page 19: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L4.pdf · –Collect more data until you coincidentally get significance level you want. –Try different ways to measure](https://reader030.fdocuments.net/reader030/viewer/2022041100/5ed72ffec30795314c175c40/html5/thumbnails/19.jpg)
SupervisedLearningNotation• Typicalsupervisedlearningsteps:
1. BuildmodelbasedontrainingdataXandy(trainingphase).2. Modelmakespredictions𝑦% ontestdata𝑋" (testingphase).
• Insteadoftrainingerror,considertesterror:– Arepredictions𝑦%similartotrueunseenlabels𝑦$?
![Page 20: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L4.pdf · –Collect more data until you coincidentally get significance level you want. –Try different ways to measure](https://reader030.fdocuments.net/reader030/viewer/2022041100/5ed72ffec30795314c175c40/html5/thumbnails/20.jpg)
GoalofMachineLearning• Inmachinelearning:– Whatwecareaboutisthetesterror!
• Midtermanalogy:– Thetrainingerroristhepracticemidterm.– Thetesterroristheactualmidterm.– Goal:dowellonactualmidterm,notthepracticeone.
• Memorizationvslearning:– Candowellontrainingdatabymemorizingit.– You’veonlylearnedifyoucandowellinnewsituations.
![Page 21: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L4.pdf · –Collect more data until you coincidentally get significance level you want. –Try different ways to measure](https://reader030.fdocuments.net/reader030/viewer/2022041100/5ed72ffec30795314c175c40/html5/thumbnails/21.jpg)
GoldenRuleofMachineLearning• Eventhoughwhatwecareaboutistesterror:– THETESTDATACANNOTINFLUENCETHETRAININGPHASEINANYWAY.
• We’remeasuringtesterrortoseehowwellwedoonnewdata:– Ifusedduringtraining,doesn’tmeasurethis.– Youcanstarttooverfit ifyouuseitduringtraining.– Midtermanalogy:youarecheatingonthetest.
![Page 22: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L4.pdf · –Collect more data until you coincidentally get significance level you want. –Try different ways to measure](https://reader030.fdocuments.net/reader030/viewer/2022041100/5ed72ffec30795314c175c40/html5/thumbnails/22.jpg)
GoldenRuleofMachineLearning• Eventhoughwhatwecareaboutistesterror:– THETESTDATACANNOTINFLUENCETHETRAININGPHASEINANYWAY.
http://www.technologyreview.com/view/538111/why-and-how-baidu-cheated-an-artificial-intelligence-test/
![Page 23: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L4.pdf · –Collect more data until you coincidentally get significance level you want. –Try different ways to measure](https://reader030.fdocuments.net/reader030/viewer/2022041100/5ed72ffec30795314c175c40/html5/thumbnails/23.jpg)
GoldenRuleofMachineLearning• Eventhoughwhatwecareaboutistesterror:– THETESTDATACANNOTINFLUENCETHETRAININGPHASEINANYWAY.
• Youalsoshouldn’tchangethetestsettogettheresultyouwant.
– http://blogs.sciencemag.org/pipeline/archives/2015/01/14/the_dukepotti_scandal_from_the_inside
https://www.cbsnews.com/news/deception-at-duke-fraud-in-cancer-care/
![Page 24: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L4.pdf · –Collect more data until you coincidentally get significance level you want. –Try different ways to measure](https://reader030.fdocuments.net/reader030/viewer/2022041100/5ed72ffec30795314c175c40/html5/thumbnails/24.jpg)
Digression:GoldenRuleandHypothesisTesting• Notethegoldenruleappliestohypothesistestinginscientificstudies.
– Datathatyoucollectcan’tinfluencethehypothesesthatyoutest.
• EXTREMELYCOMMON andaMAJORPROBLEM,cominginmanyforms:– Collectmoredatauntilyoucoincidentallygetsignificancelevelyouwant.– Trydifferentwaystomeasureperformance,choosetheonethatlooksbest.– Chooseadifferenttypeofmodel/hypothesisafterlookingatthetestdata.
• Ifyouwanttomodifyyourhypotheses,youneedtotestonnewdata.– Oratleastbeawareandhonestaboutthisissuewhenreportingresults.
![Page 25: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L4.pdf · –Collect more data until you coincidentally get significance level you want. –Try different ways to measure](https://reader030.fdocuments.net/reader030/viewer/2022041100/5ed72ffec30795314c175c40/html5/thumbnails/25.jpg)
Digression:GoldenRuleandHypothesisTesting• Notethegoldenruleappliestohypothesistestinginscientificstudies.
– Datathatyoucollectcan’tinfluencethehypothesesthatyoutest.
• EXTREMELYCOMMON andaMAJORPROBLEM,cominginmanyforms:– “ReplicationcrisisinScience”.– “WhyMostPublishedResearchFindingsareFalse”.– “False-PositivePsychology:UndisclosedFlexibilityinDataCollectionandAnalysisAllowsPresentingAnythingasSignificant”.
– “HARKing:HypothesizingAftertheResultsareKnown”.– “HackYourWayToScientificGlory”.– “Psychology’sReplicationCrisisHasMadeTheFieldBetter”(somesolutions)
![Page 26: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L4.pdf · –Collect more data until you coincidentally get significance level you want. –Try different ways to measure](https://reader030.fdocuments.net/reader030/viewer/2022041100/5ed72ffec30795314c175c40/html5/thumbnails/26.jpg)
IsLearningPossible?• Doestrainingerrorsayanythingabouttesterror?– Ingeneral,NO:Testdatamighthavenothingtodowithtrainingdata.– E.g.,“adversary”takestrainingdataandflipsalllabels.
• Inordertolearn,weneedassumptions:– Thetrainingandtestdataneedtoberelatedinsomeway.– Mostcommonassumption:independentandidenticallydistributed(IID).
Egg Milk Fish
0 0.7 0
0.3 0.7 1
0.3 0 0
Sick?
1
1
0
X= y=
Egg Milk Fish
0 0.7 0
0.3 0.7 1
0.3 0 0
Sick?
0
0
1
𝑋" = 𝑦$ =
![Page 27: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L4.pdf · –Collect more data until you coincidentally get significance level you want. –Try different ways to measure](https://reader030.fdocuments.net/reader030/viewer/2022041100/5ed72ffec30795314c175c40/html5/thumbnails/27.jpg)
IIDAssumption• Training/testdataisindependentandidenticallydistributed(IID)if:
– Allexamplescomefromthesamedistribution(identicallydistributed).– Theexamplearesampledindependently(orderdoesn’tmatter).
• Examplesintermsofcards:– Pickacard,putitbackinthedeck,re-shuffle,repeat.– Pickacard,putitbackinthedeck,repeat.– Pickacard,don’tputitback,re-shuffle,repeat.
Age Job? City Rating Income
23 Yes Van A 22,000.0023 Yes Bur BBB 21,000.0022 No Van CC 0.0025 Yes Sur AAA 57,000.00
![Page 28: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L4.pdf · –Collect more data until you coincidentally get significance level you want. –Try different ways to measure](https://reader030.fdocuments.net/reader030/viewer/2022041100/5ed72ffec30795314c175c40/html5/thumbnails/28.jpg)
IIDAssumptionandFoodAllergyExample• IsthefoodallergydataIID?– Doalltheexamplescomefromthesamedistribution?– Doestheorderoftheexamplesmatter?
• No!– Beingsickmightdependonwhatyouateyesterday(notindependent).– Youreatinghabitsmightchangedovertime(notidenticallydistributed).
• Whatcanwedoaboutthis?– Justignorethatdataisn’tIIDandhopeforthebest?– Foreachday,maybeaddthefeaturesfromthepreviousday?– Maybeaddtimeasanextrafeature?
![Page 29: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L4.pdf · –Collect more data until you coincidentally get significance level you want. –Try different ways to measure](https://reader030.fdocuments.net/reader030/viewer/2022041100/5ed72ffec30795314c175c40/html5/thumbnails/29.jpg)
LearningTheory• WhydoestheIIDassumptionmakelearningpossible?
– Patternsintrainingexamplesarelikelytobethesameintestexamples.• TheIIDassumptionisrarelytrue:
– Butitisoftenagoodapproximation.– Thereareotherpossibleassumptions.
• Also,we’reassumingIIDacrossexamplesbutnotacrossfeatures.
• Learningtheoryexploreshowtrainingerrorisrelatedtotesterror.• We’lllookatasimpleexample,usingthisnotation:
– Etrain istheerrorontrainingdata.– Etest istheerrorontestingdata.
![Page 30: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L4.pdf · –Collect more data until you coincidentally get significance level you want. –Try different ways to measure](https://reader030.fdocuments.net/reader030/viewer/2022041100/5ed72ffec30795314c175c40/html5/thumbnails/30.jpg)
FundamentalTrade-Off• StartwithEtest =Etest,thenaddandsubtractEtrain ontheright:
• Howdoesthishelp?– IfEapprox issmall,thenEtrain isagoodapproximationtoEtest.
• WhatdoesEapprox (“amountofoverfitting”)dependon?– Ittendstogetsmalleras‘n’getslarger.– Ittendstogrowasmodelgetmore“complicated”.
![Page 31: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L4.pdf · –Collect more data until you coincidentally get significance level you want. –Try different ways to measure](https://reader030.fdocuments.net/reader030/viewer/2022041100/5ed72ffec30795314c175c40/html5/thumbnails/31.jpg)
FundamentalTrade-Off• Thisleadstoafundamentaltrade-off:
1. Etrain:howsmallyoucanmakethetrainingerror.vs.
2. Eapprox:howwelltrainingerrorapproximatesthetesterror.
• Simplemodels (likedecisionstumps):– Eapprox islow(notverysensitivetotrainingset).– ButEtrain mightbehigh.
• Complexmodels(likedeepdecisiontrees):– Etrain canbelow.– ButEapprox mightbehigh(verysensitivetotrainingset).
![Page 32: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L4.pdf · –Collect more data until you coincidentally get significance level you want. –Try different ways to measure](https://reader030.fdocuments.net/reader030/viewer/2022041100/5ed72ffec30795314c175c40/html5/thumbnails/32.jpg)
FundamentalTrade-Off• Trainingerrorvs.testerrorforchoosingdepth:– Trainingerrorishighforlowdepth(underfitting)– Trainingerrorgetsbetterwithdepth.– Testerrorinitiallygoesdown,buteventuallyincreases(overfitting).
![Page 33: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L4.pdf · –Collect more data until you coincidentally get significance level you want. –Try different ways to measure](https://reader030.fdocuments.net/reader030/viewer/2022041100/5ed72ffec30795314c175c40/html5/thumbnails/33.jpg)
ValidationError• Howdowedecidedecisiontreedepth?• Wecareabouttesterror.• Butwecan’tlookattestdata.• Sowhatdowedo?????
• Oneanswer:Usepartofthetrainingdatatoapproximatetesterror.
• Splittrainingexamplesintotraining setandvalidation set:– Trainmodelbasedonthetrainingdata.– Testmodelbasedonthevalidationdata.
![Page 34: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L4.pdf · –Collect more data until you coincidentally get significance level you want. –Try different ways to measure](https://reader030.fdocuments.net/reader030/viewer/2022041100/5ed72ffec30795314c175c40/html5/thumbnails/34.jpg)
ValidationError
![Page 35: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L4.pdf · –Collect more data until you coincidentally get significance level you want. –Try different ways to measure](https://reader030.fdocuments.net/reader030/viewer/2022041100/5ed72ffec30795314c175c40/html5/thumbnails/35.jpg)
ValidationError• IIDdata:validationerrorisunbiasedapproximationoftesterror.
• Midtermanalogy:– Youhave2practicemidterms.– Youhideonemidterm,andspendalotoftimeworkingthroughtheother.– Youthendotheotherpracticeterm,toseehowwellyou’lldoonthetest.
• Wetypicallyusevalidationerrortochoose“hyper-parameters”…
![Page 36: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L4.pdf · –Collect more data until you coincidentally get significance level you want. –Try different ways to measure](https://reader030.fdocuments.net/reader030/viewer/2022041100/5ed72ffec30795314c175c40/html5/thumbnails/36.jpg)
Notation:ParametersandHyper-Parameters• Thedecisiontreerule valuesarecalled“parameters”.– Parameterscontrolhowwellwefitadataset.– We“train”amodelbytryingtofindthebestparametersontrainingdata.
• Thedecisiontreedepth isacalleda“hyper-parameter”.– Hyper-parameterscontrolhowcomplexourmodelis.– Wecan’t“train”ahyper-parameter.
• Youcanalwaysfittrainingdatabetterbymakingthemodelmorecomplicated.
– We“validate”ahyper-parameterusingavalidationscore.
• (“Hyper-parameter”issometimesusedforparameters“notfitwithdata”.)
![Page 37: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L4.pdf · –Collect more data until you coincidentally get significance level you want. –Try different ways to measure](https://reader030.fdocuments.net/reader030/viewer/2022041100/5ed72ffec30795314c175c40/html5/thumbnails/37.jpg)
ChoosingHyper-ParameterswithValidationSet• Sotochooseagoodvalueofdepth(“hyper-parameter”),wecould:– Tryadepth-1decisiontree,computevalidationerror.– Tryadepth-2decisiontree,computevalidationerror.– Tryadepth-3decisiontree,computevalidationerror.– …– Tryadepth-20decisiontree,computevalidationerror.– Returnthedepthwiththelowestvalidationerror.
• Afteryouchoosethehyper-parameter,weusuallyre-trainonthefulltrainingsetwiththechosenhyper-parameter.
![Page 38: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L4.pdf · –Collect more data until you coincidentally get significance level you want. –Try different ways to measure](https://reader030.fdocuments.net/reader030/viewer/2022041100/5ed72ffec30795314c175c40/html5/thumbnails/38.jpg)
Digression:OptimizationBias• Anothernameforoverfittingis“optimizationbias”:– Howbiasedisan“error”thatweoptimizedovermanypossibilities?
• Optimizationbiasofparameterlearning:– Duringlearning,wecouldsearchovertonsofdifferentdecisiontrees.– Sowecanget“lucky”andfindonewithlowtrainingerrorbychance.
• “Overfittingofthetrainingerror”.
• Optimizationbiasofhyper-parametertuning:– Here,wemightoptimizethevalidationerrorover20valuesof“depth”.– Oneofthe20treesmighthavelowvalidationerrorbychance.
• “Overfittingofthevalidationerror”.
![Page 39: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L4.pdf · –Collect more data until you coincidentally get significance level you want. –Try different ways to measure](https://reader030.fdocuments.net/reader030/viewer/2022041100/5ed72ffec30795314c175c40/html5/thumbnails/39.jpg)
Digression:ExampleofOptimizationBias• Consideramultiple-choice(a,b,c,d)“test”with10questions:– Ifyouchooseanswersrandomly,expectedgradeis25%(nobias).– Ifyoufillouttwotestsrandomlyandpickthebest,expectedgradeis33%.
• Optimizationbiasof~8%.
– Ifyoutakethebestamong10 randomtests,expectedgradeis~47%.– Ifyoutakethebestamong100,expectedgradeis~62%.– Ifyoutakethebestamong1000,expectedgradeis~73%.– Ifyoutakethebestamong10000,expectedgradeis~82%.
• Youhavesomany“chances”thatyouexpecttodowell.
• Butonnewquestionsthe“randomchoice”accuracyisstill25%.
![Page 40: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L4.pdf · –Collect more data until you coincidentally get significance level you want. –Try different ways to measure](https://reader030.fdocuments.net/reader030/viewer/2022041100/5ed72ffec30795314c175c40/html5/thumbnails/40.jpg)
FactorsAffectingOptimizationBias• Ifweinsteaduseda100-questiontestthen:
– Expectedgradefrombestover1randomly-filledtestis25%.– Expectedgradefrombestover2randomly-filledtestis~27%.– Expectedgradefrombestover10randomly-filledtestis~32%.– Expectedgradefrombestover100randomly-filledtestis~36%.– Expectedgradefrombestover1000randomly-filledtestis~40%.– Expectedgradefrombestover10000randomly-filledtestis~47%.
• Theoptimizationbiasgrowswiththenumberofthingswetry.– “Complexity”ofthesetofmodelswesearchover.
• But,optimizationbiasshrinksquicklywiththenumberofexamples.– Butit’sstillnon-zeroandgrowingifyouover-useyourvalidationset!
![Page 41: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L4.pdf · –Collect more data until you coincidentally get significance level you want. –Try different ways to measure](https://reader030.fdocuments.net/reader030/viewer/2022041100/5ed72ffec30795314c175c40/html5/thumbnails/41.jpg)
Summary• Trainingerrorvs.testingerror:
– Whatwecareaboutinmachinelearningisthetestingerror.• Goldenruleofmachinelearning:
– Thetestdatacannotinfluencetrainingthemodelinanyway.• Independentandidenticallydistributed(IID):
– Oneassumptionthatmakeslearningpossible.• Fundamentaltrade-off:
– Trade-offbetweengettinglowtrainingerrorandhavingtrainingerrorapproximatetesterror.• Validationset:
– Wecansavepartofourtrainingdatatoapproximatetesterror.• Hyper-parameters:
– Parametersthatcontrolmodelcomplexity,typicallysetwithavalidationset.
• Nexttime:– Wediscussthe“best”machinelearningmethod.
![Page 42: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L4.pdf · –Collect more data until you coincidentally get significance level you want. –Try different ways to measure](https://reader030.fdocuments.net/reader030/viewer/2022041100/5ed72ffec30795314c175c40/html5/thumbnails/42.jpg)
“testerror”vs.“testseterror”vs.“validationerror”
![Page 43: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L4.pdf · –Collect more data until you coincidentally get significance level you want. –Try different ways to measure](https://reader030.fdocuments.net/reader030/viewer/2022041100/5ed72ffec30795314c175c40/html5/thumbnails/43.jpg)
“testerror”vs.“testseterror”vs.“validationerror”
![Page 44: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L4.pdf · –Collect more data until you coincidentally get significance level you want. –Try different ways to measure](https://reader030.fdocuments.net/reader030/viewer/2022041100/5ed72ffec30795314c175c40/html5/thumbnails/44.jpg)
ApproximationErrorforSelectingHyper-Parameters
• Fromthe2019EasyMarkit AIHackathon:– “Weendedupselectingthehyperparametersthatgaveusthelowestapproximationerror(gapbetweentrainandvalidation)asopposedtothelowestvalidationerror.Thiswasquiteadifficultdecisionforourteamsincewewereonlyallowedonesubmission.However,themodelwiththelowestvalidationerrorhadaveryhighapproximationerror,whichfelttoorisky,sowewentwithamodelwithaslightlyhighervalidationerrorandmuchlowerapproximationerror.Whentheresultswereannounced,thereportedtestaccuracywaswithin0.1%ofwhatourmodelpredictedwiththevalidationset.”
• Thisisthetypeofreasoningyouwanttodo.– Ahighapproximationerrorcouldindicatelowvalidationerrorbychance.
![Page 45: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L4.pdf · –Collect more data until you coincidentally get significance level you want. –Try different ways to measure](https://reader030.fdocuments.net/reader030/viewer/2022041100/5ed72ffec30795314c175c40/html5/thumbnails/45.jpg)
“AvisualIntroductiontomachinelearning”• The“housingprices”exampleistakenfromthiswebsite:– http://www.r2d3.us/visual-intro-to-machine-learning-part-1
• Theyalsohavea“Part2”here:– http://www.r2d3.us/visual-intro-to-machine-learning-part-2
• Part2coverssimilartopicstowhatwecoveredinthislecture.
![Page 46: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L4.pdf · –Collect more data until you coincidentally get significance level you want. –Try different ways to measure](https://reader030.fdocuments.net/reader030/viewer/2022041100/5ed72ffec30795314c175c40/html5/thumbnails/46.jpg)
BoundingEapprox• Let’sassumewehaveafixedmodel‘h’(likeadecisiontree),andthenwecollectatrainingsetof‘n’examples.
• Whatistheprobabilitythattheerroronthistrainingset(Etrain),iswithinsomesmallnumberε ofthetesterror(Etest)?
• From“Hoeffding’s inequality”wehave:
• Thisisgreat!Inthissettingtheprobabilitythatourtrainingerrorisfarfromourtesterrorgoesdownexponentiallyintermsofthenumberofsamples‘n’.
![Page 47: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L4.pdf · –Collect more data until you coincidentally get significance level you want. –Try different ways to measure](https://reader030.fdocuments.net/reader030/viewer/2022041100/5ed72ffec30795314c175c40/html5/thumbnails/47.jpg)
BoundingEapprox• Unfortunately,thelastslidegetsitbackwards:
– Weusuallydon’tpickamodelandthencollectadataset.– Weusuallycollectadatasetandthenpickthemodel‘w’ basedonthedata.
• Wenowpickedthemodelthatdidbestonthedata,andHoeffding’sinequalitydoesn’taccountfortheoptimizationbiasofthisprocedure.
• Onewaytogetaroundthisistobound(Etest – Etrain)forallmodelsinthespaceofmodelsweareoptimizingover.– Ifwebounditforallmodels,thenwebounditforthebestmodel.– Thisgiveslooserbutcorrectbounds.
![Page 48: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L4.pdf · –Collect more data until you coincidentally get significance level you want. –Try different ways to measure](https://reader030.fdocuments.net/reader030/viewer/2022041100/5ed72ffec30795314c175c40/html5/thumbnails/48.jpg)
BoundingEapprox• Ifweonlyoptimizeoverafinitenumberofevents‘k’,wecanusethe“unionbound”thatforevents{A1,A2,…,Ak}wehave:
• CombiningHoeffding’s inequalityandtheunionboundgives:
![Page 49: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L4.pdf · –Collect more data until you coincidentally get significance level you want. –Try different ways to measure](https://reader030.fdocuments.net/reader030/viewer/2022041100/5ed72ffec30795314c175c40/html5/thumbnails/49.jpg)
BoundingEapprox• So,withtheoptimizationbiasofsetting“h*”tothebest‘h’among‘k’models,probabilitythat(Etest – Etrain)isbiggerthanε satisfies:
• Sooptimizingoverafewmodelsisokifwehavelotsofexamples.• Ifwetrylotsofmodelsthen(Etest – Etrain)couldbeverylarge.• Laterinthecoursewe’llbesearchingovercontinuousmodelswherek=infinity,sothisboundisuseless.
• Tohandlecontinuousmodels,onewayisviatheVC-dimension.– SimplermodelswillhavelowerVC-dimension.
![Page 50: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L4.pdf · –Collect more data until you coincidentally get significance level you want. –Try different ways to measure](https://reader030.fdocuments.net/reader030/viewer/2022041100/5ed72ffec30795314c175c40/html5/thumbnails/50.jpg)
RefinedFundamentalTrade-Off• LetEbest betheirreducibleerror(lowestpossibleerrorforanymodel).
• Forexample,irreducibleerrorforpredictingcoinflipsis0.5.
• SomelearningtheoryresultsuseEbest tofuther decomposeEtest:
• Thisissimilartothebias-variancedecomposition:– Term1:measureofvariance(howsensitivewearetotrainingdata).– Term2:measureofbias (howlowcanwemakethetrainingerror).– Term3:measureofnoise (howlowcananymodelmaketesterror).
![Page 51: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L4.pdf · –Collect more data until you coincidentally get significance level you want. –Try different ways to measure](https://reader030.fdocuments.net/reader030/viewer/2022041100/5ed72ffec30795314c175c40/html5/thumbnails/51.jpg)
RefinedFundamentalTrade-Off• Decisiontreewithhighdepth:– Verylikelytofitdatawell,sobiasislow.– Butmodelchangesalotifyouchangethedata,sovarianceishigh.
• Decisiontreewithlowdepth:– Lesslikelytofitdatawell,sobiasishigh.– Butmodeldoesn’tchangemuchyouchangedata,sovarianceislow.
• Anddegreedoesnotaffectirreducibleerror.– Irreducibleerrorcomesfromthebestpossiblemodel.
![Page 52: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L4.pdf · –Collect more data until you coincidentally get significance level you want. –Try different ways to measure](https://reader030.fdocuments.net/reader030/viewer/2022041100/5ed72ffec30795314c175c40/html5/thumbnails/52.jpg)
Bias-VarianceDecomposition• Youmayhaveseen“bias-variancedecomposition”inotherclasses:
– Assumes𝑦$ i =𝑦(i +ε,whereε hasmean0andvarianceσ2.– Assumeswehavea“learner”thatcantake‘n’trainingexamplesandusethesetomakepredictions𝑦%i.
• Expectedsquaredtesterrorinthissettingis
– Whereexpectationsaretakenoverpossibletrainingsets of‘n’examples.– Bias isexpectederrorduetohavingwrongmodel.– Variance isexpectederrorduetosensitivitytothetrainingset.– Noise (irreducibleerror)isthebestcanhopeforgiventhenoise(Ebest).
![Page 53: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L4.pdf · –Collect more data until you coincidentally get significance level you want. –Try different ways to measure](https://reader030.fdocuments.net/reader030/viewer/2022041100/5ed72ffec30795314c175c40/html5/thumbnails/53.jpg)
Bias-Variancevs.FundamentalTrade-Off• Bothdecompositionsservethesamepurpose:– Tryingtoevaluatehowdifferentfactorsaffecttesterror.
• Theybothleadtothesame3conclusions:1. SimplemodelscanhavehighEtrain/bias,lowEapprox/variance.2. ComplexmodelscanhavelowEtrain/bias,highEapprox/variance.3. Asyouincrease‘n’,Eapprox/variancegoesdown(forfixedcomplexity).
![Page 54: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L4.pdf · –Collect more data until you coincidentally get significance level you want. –Try different ways to measure](https://reader030.fdocuments.net/reader030/viewer/2022041100/5ed72ffec30795314c175c40/html5/thumbnails/54.jpg)
Bias-Variancevs.FundamentalTrade-Off• Sowhyfocusonfundamentaltrade-offandnotbias-variance?– Simplest viewpointthatgivesthese3conclusions.– Noassumptionslikebeingrestrictedtosquarederror.
– YoucanmeasureEtrain butnotEapprox (1knownand1unknown).• IfEtrain islowandyouexpectEapprox tobelow,thenyouarehappy.
– E.g.,youfitaverysimplemodeloryouusedahugeindependentvalidationset.
– Youcan’tmeasurebias,variance,ornoise(3unknowns).• IfEtrain islow,bias-variancedecompositiondoesn’tsayanythingabouttesterror.
– Youonlyhaveyourtrainingset,notdistributionoverpossibledatasets.– Doesn’tsayifhighEtest isduetobiasorvarianceornoise.
![Page 55: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L4.pdf · –Collect more data until you coincidentally get significance level you want. –Try different ways to measure](https://reader030.fdocuments.net/reader030/viewer/2022041100/5ed72ffec30795314c175c40/html5/thumbnails/55.jpg)
LearningTheory• Bias-variancedecompositionisabitweirdcomparedtoourpreviousdecompositionsofEtest:– Bias-variancedecompositionconsidersexpectationoverpossibletrainingsets.– Butdoesn’tsayanythingabouttesterrorwithyour trainingset.
• Somekeywordsifyouwanttolearnaboutlearningtheory:– Bias-variancedecomposition,samplecomplexity,probablyapproximatelycorrect(PAC)learning,Vapnik-Chernovenkis (VC)dimension,Rademacher complexity.
• Agentleplacetostartisthe“LearningfromData”book:– https://work.caltech.edu/telecourse.html
![Page 56: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L4.pdf · –Collect more data until you coincidentally get significance level you want. –Try different ways to measure](https://reader030.fdocuments.net/reader030/viewer/2022041100/5ed72ffec30795314c175c40/html5/thumbnails/56.jpg)
ATheoreticalAnswerto“HowMuchData?”• AssumewehaveasourceofIIDexamplesandafixedclassofparametricmodels.
• Like“alldepth-5decisiontrees”.
• Undersomenastyassumptions,with‘n’trainingexamplesitholdsthat:E[testerrorofbestmodelontrainingset]– (besttesterrorinclass)=O(1/n).
• Yourarelyknowtheconstantfactor,butthisgivessomeguidelines:– Addingmoredatahelpsmoreonsmalldatasetsthanonlargedatasets.
• Goingfrom10trainingexamplesto20,differencewithbestpossibleerrorgetscutinhalf.– Ifthebestpossibleerroris15%youmightgofrom20%to17.5%(thisdoesnotmean20%to10%).
• Goingfrom110trainingexamplesto120,erroronlygoesdownby~10%.• Goingfrom1Mtrainingexamplesto1M+10,youwon’tnoticeachange.
– Doublingthedatasizecutstheerrorinhalf:• Goingfrom1Mtrainingto2Mtrainingexamples,errorgetscutinhalf.• Ifyoudoublethedatasizeandyourtesterrordoesn’timprove,moredatamightnothelp.
![Page 57: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L4.pdf · –Collect more data until you coincidentally get significance level you want. –Try different ways to measure](https://reader030.fdocuments.net/reader030/viewer/2022041100/5ed72ffec30795314c175c40/html5/thumbnails/57.jpg)
CanyoutesttheIIDassumption?• Ingeneral,testingtheIIDassumptionisnoteasy.– Usually,youneedbackgroundknowledgetodecideifit’sreasonable.
• Sometestsdoexist,likeshufflingtheorderofdataandthenmeasuringifsomebasicstatisticsagree.– It’sreasonabletocheckifsummarystatisticsoftrainandtestdataagree.
• Ifnot,yourtrainedmodelmaynotbesouseful.
• Somediscussionhere:– https://stats.stackexchange.com/questions/28715/test-for-iid-sampling