CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L4.pdf · –Collect more data until...

CPSC340:MachineLearningandDataMining

FundamentalsofLearningFall2019

Admin• Assignment1 isdueWednesday:youshouldbealmostdone.• Waitinglistpeople:everyoneshouldbeinsoon?• Coursewebpage:– https://www.cs.ubc.ca/~fwood/CS340/

• Auditors:– BringyourformsattheendofclassFriday,assumingweclearwaitlist.

• Exchangestudents:– Ifyouarestillhavingtroubleregistering,bringyourformsFriday.– ContactusonPiazzaaboutgettingregisteredforGradescope.

• Midtermconfirmed(14/Feb;6pm->8pm,Wesbrook 100).

LastTime:SupervisedLearningNotation

• Featurematrix‘X’ hasrowsasexamples,columnsasfeatures.– xij isfeature‘j’forexample‘i’(quantityoffood‘j’onday‘i’).– xi isthelistofallfeaturesforexample‘i’(allthequantitiesonday‘i’).– xj iscolumn‘j’ofthematrix (thevalueoffeature‘j’acrossallexamples).

• Labelvector‘y’ containsthelabelsoftheexamples.– yi isthelabelofexample‘i’ (1for“sick”,0for“notsick”).

Egg Milk Fish Wheat Shellfish Peanuts

0 0.7 0 0.3 0 0

0.3 0.7 0 0.6 0 0.01

0 0 0 0.8 0 0

0.3 0.7 1.2 0 0.10 0.01

0.3 0 1.2 0.3 0.10 0.01

Sick?

1

1

0

1

1

SupervisedLearningApplication• Wemotivatedsupervisedlearningbythe“foodallergy”example.

• Butwecanusesupervisedlearningforanyinput:output mapping.– E-mailspamfiltering.– Opticalcharacterrecognitiononscanners.– Recognizingfacesinpictures.– Recognizingtumoursinmedicalimages.– Speechrecognitiononphones.– Yourprobleminindustry/research?

Motivation:DetermineHomeCity• Wearegivendatafrom248homes.• Foreachhome/example,wehavethesefeatures:– Elevation.– Year.– Bathrooms– Bedrooms.– Price.– Squarefeet.

• GoalistobuildaprogramthatpredictsSF orNY.

Thisexampleandimagesofitcomefrom:http://www.r2d3.us/visual-intro-to-machine-learning-part-1

PlottingElevation

SimpleDecisionStump

ScatterplotArray

PlottingElevationandPrice/SqFt

SimpleDecisionTreeClassification

Howdoesthedepthaffectaccuracy?

Thisisagoodstart(>75%accuracy).


Startsplittingthedatarecursively…


Accuracykeepsincreasingasweadddepth.


Eventually,wecanperfectlyclassifyallofourdata.

Trainingvs.TestingError• Withthisdecisiontree,‘trainingaccuracy’is1.

– It perfectlylabelsthedataweusedtomakethetree.• Wearenowgivenfeaturesfor217newhomes.• Whatisthe‘testingaccuracy’onthenewdata?

– Howdoesitdoondatanotused tomakethetree?

• Overfitting:loweraccuracyonnewdata.– Ourrulesgottoospecifictoourexacttrainingdataset.– Someofthe“deep”splitsonlyuseafewexamples(bad“couponcollecting”).

SupervisedLearningNotation• Wearegiventrainingdata whereweknowlabels:

• Butthereisalsotestingdata wewanttolabel:

Egg Milk Fish Wheat Shellfish Peanuts …

0 0.7 0 0.3 0 0

0.3 0.7 0 0.6 0 0.01

0 0 0 0.8 0 0

0.3 0.7 1.2 0 0.10 0.01

0.3 0 1.2 0.3 0.10 0.01

Sick?

1

1

0

1

1

X= y=

Egg Milk Fish Wheat Shellfish Peanuts …

0.5 0 1 0.6 2 1

0 0.7 0 1 0 0

3 1 0 0.5 0 0

Sick?

?

?

?

𝑋"= 𝑦$=

SupervisedLearningNotation• Typicalsupervisedlearningsteps:

1. BuildmodelbasedontrainingdataXandy(trainingphase).2. Modelmakespredictions𝑦% ontestdata𝑋" (testingphase).

• Insteadoftrainingerror,considertesterror:– Arepredictions𝑦%similartotrueunseenlabels𝑦$?

GoalofMachineLearning• Inmachinelearning:– Whatwecareaboutisthetesterror!

• Midtermanalogy:– Thetrainingerroristhepracticemidterm.– Thetesterroristheactualmidterm.– Goal:dowellonactualmidterm,notthepracticeone.

• Memorizationvslearning:– Candowellontrainingdatabymemorizingit.– You’veonlylearnedifyoucandowellinnewsituations.

GoldenRuleofMachineLearning• Eventhoughwhatwecareaboutistesterror:– THETESTDATACANNOTINFLUENCETHETRAININGPHASEINANYWAY.

• We’remeasuringtesterrortoseehowwellwedoonnewdata:– Ifusedduringtraining,doesn’tmeasurethis.– Youcanstarttooverfit ifyouuseitduringtraining.– Midtermanalogy:youarecheatingonthetest.


http://www.technologyreview.com/view/538111/why-and-how-baidu-cheated-an-artificial-intelligence-test/


• Youalsoshouldn’tchangethetestsettogettheresultyouwant.

– http://blogs.sciencemag.org/pipeline/archives/2015/01/14/the_dukepotti_scandal_from_the_inside

https://www.cbsnews.com/news/deception-at-duke-fraud-in-cancer-care/

Digression:GoldenRuleandHypothesisTesting• Notethegoldenruleappliestohypothesistestinginscientificstudies.

– Datathatyoucollectcan’tinfluencethehypothesesthatyoutest.

• EXTREMELYCOMMON andaMAJORPROBLEM,cominginmanyforms:– Collectmoredatauntilyoucoincidentallygetsignificancelevelyouwant.– Trydifferentwaystomeasureperformance,choosetheonethatlooksbest.– Chooseadifferenttypeofmodel/hypothesisafterlookingatthetestdata.

• Ifyouwanttomodifyyourhypotheses,youneedtotestonnewdata.– Oratleastbeawareandhonestaboutthisissuewhenreportingresults.

Digression:GoldenRuleandHypothesisTesting• Notethegoldenruleappliestohypothesistestinginscientificstudies.

– Datathatyoucollectcan’tinfluencethehypothesesthatyoutest.

• EXTREMELYCOMMON andaMAJORPROBLEM,cominginmanyforms:– “ReplicationcrisisinScience”.– “WhyMostPublishedResearchFindingsareFalse”.– “False-PositivePsychology:UndisclosedFlexibilityinDataCollectionandAnalysisAllowsPresentingAnythingasSignificant”.

– “HARKing:HypothesizingAftertheResultsareKnown”.– “HackYourWayToScientificGlory”.– “Psychology’sReplicationCrisisHasMadeTheFieldBetter”(somesolutions)

IsLearningPossible?• Doestrainingerrorsayanythingabouttesterror?– Ingeneral,NO:Testdatamighthavenothingtodowithtrainingdata.– E.g.,“adversary”takestrainingdataandflipsalllabels.

• Inordertolearn,weneedassumptions:– Thetrainingandtestdataneedtoberelatedinsomeway.– Mostcommonassumption:independentandidenticallydistributed(IID).

Egg Milk Fish

0 0.7 0

0.3 0.7 1

0.3 0 0

Sick?

1

1

0

X= y=

Egg Milk Fish

0 0.7 0

0.3 0.7 1

0.3 0 0

Sick?

0

0

1

𝑋" = 𝑦$ =

IIDAssumption• Training/testdataisindependentandidenticallydistributed(IID)if:

– Allexamplescomefromthesamedistribution(identicallydistributed).– Theexamplearesampledindependently(orderdoesn’tmatter).

• Examplesintermsofcards:– Pickacard,putitbackinthedeck,re-shuffle,repeat.– Pickacard,putitbackinthedeck,repeat.– Pickacard,don’tputitback,re-shuffle,repeat.

Age Job? City Rating Income

23 Yes Van A 22,000.0023 Yes Bur BBB 21,000.0022 No Van CC 0.0025 Yes Sur AAA 57,000.00

IIDAssumptionandFoodAllergyExample• IsthefoodallergydataIID?– Doalltheexamplescomefromthesamedistribution?– Doestheorderoftheexamplesmatter?

• No!– Beingsickmightdependonwhatyouateyesterday(notindependent).– Youreatinghabitsmightchangedovertime(notidenticallydistributed).

• Whatcanwedoaboutthis?– Justignorethatdataisn’tIIDandhopeforthebest?– Foreachday,maybeaddthefeaturesfromthepreviousday?– Maybeaddtimeasanextrafeature?

LearningTheory• WhydoestheIIDassumptionmakelearningpossible?

– Patternsintrainingexamplesarelikelytobethesameintestexamples.• TheIIDassumptionisrarelytrue:

– Butitisoftenagoodapproximation.– Thereareotherpossibleassumptions.

• Also,we’reassumingIIDacrossexamplesbutnotacrossfeatures.

• Learningtheoryexploreshowtrainingerrorisrelatedtotesterror.• We’lllookatasimpleexample,usingthisnotation:

– Etrain istheerrorontrainingdata.– Etest istheerrorontestingdata.

FundamentalTrade-Off• StartwithEtest =Etest,thenaddandsubtractEtrain ontheright:

• Howdoesthishelp?– IfEapprox issmall,thenEtrain isagoodapproximationtoEtest.

• WhatdoesEapprox (“amountofoverfitting”)dependon?– Ittendstogetsmalleras‘n’getslarger.– Ittendstogrowasmodelgetmore“complicated”.

FundamentalTrade-Off• Thisleadstoafundamentaltrade-off:

1. Etrain:howsmallyoucanmakethetrainingerror.vs.

2. Eapprox:howwelltrainingerrorapproximatesthetesterror.

• Simplemodels (likedecisionstumps):– Eapprox islow(notverysensitivetotrainingset).– ButEtrain mightbehigh.

• Complexmodels(likedeepdecisiontrees):– Etrain canbelow.– ButEapprox mightbehigh(verysensitivetotrainingset).

FundamentalTrade-Off• Trainingerrorvs.testerrorforchoosingdepth:– Trainingerrorishighforlowdepth(underfitting)– Trainingerrorgetsbetterwithdepth.– Testerrorinitiallygoesdown,buteventuallyincreases(overfitting).

ValidationError• Howdowedecidedecisiontreedepth?• Wecareabouttesterror.• Butwecan’tlookattestdata.• Sowhatdowedo?????

• Oneanswer:Usepartofthetrainingdatatoapproximatetesterror.

• Splittrainingexamplesintotraining setandvalidation set:– Trainmodelbasedonthetrainingdata.– Testmodelbasedonthevalidationdata.

ValidationError

ValidationError• IIDdata:validationerrorisunbiasedapproximationoftesterror.

• Midtermanalogy:– Youhave2practicemidterms.– Youhideonemidterm,andspendalotoftimeworkingthroughtheother.– Youthendotheotherpracticeterm,toseehowwellyou’lldoonthetest.

• Wetypicallyusevalidationerrortochoose“hyper-parameters”…

Notation:ParametersandHyper-Parameters• Thedecisiontreerule valuesarecalled“parameters”.– Parameterscontrolhowwellwefitadataset.– We“train”amodelbytryingtofindthebestparametersontrainingdata.

• Thedecisiontreedepth isacalleda“hyper-parameter”.– Hyper-parameterscontrolhowcomplexourmodelis.– Wecan’t“train”ahyper-parameter.

• Youcanalwaysfittrainingdatabetterbymakingthemodelmorecomplicated.

– We“validate”ahyper-parameterusingavalidationscore.

• (“Hyper-parameter”issometimesusedforparameters“notfitwithdata”.)

ChoosingHyper-ParameterswithValidationSet• Sotochooseagoodvalueofdepth(“hyper-parameter”),wecould:– Tryadepth-1decisiontree,computevalidationerror.– Tryadepth-2decisiontree,computevalidationerror.– Tryadepth-3decisiontree,computevalidationerror.– …– Tryadepth-20decisiontree,computevalidationerror.– Returnthedepthwiththelowestvalidationerror.

• Afteryouchoosethehyper-parameter,weusuallyre-trainonthefulltrainingsetwiththechosenhyper-parameter.

Digression:OptimizationBias• Anothernameforoverfittingis“optimizationbias”:– Howbiasedisan“error”thatweoptimizedovermanypossibilities?

• Optimizationbiasofparameterlearning:– Duringlearning,wecouldsearchovertonsofdifferentdecisiontrees.– Sowecanget“lucky”andfindonewithlowtrainingerrorbychance.

• “Overfittingofthetrainingerror”.

• Optimizationbiasofhyper-parametertuning:– Here,wemightoptimizethevalidationerrorover20valuesof“depth”.– Oneofthe20treesmighthavelowvalidationerrorbychance.

• “Overfittingofthevalidationerror”.

Digression:ExampleofOptimizationBias• Consideramultiple-choice(a,b,c,d)“test”with10questions:– Ifyouchooseanswersrandomly,expectedgradeis25%(nobias).– Ifyoufillouttwotestsrandomlyandpickthebest,expectedgradeis33%.

• Optimizationbiasof~8%.

– Ifyoutakethebestamong10 randomtests,expectedgradeis~47%.– Ifyoutakethebestamong100,expectedgradeis~62%.– Ifyoutakethebestamong1000,expectedgradeis~73%.– Ifyoutakethebestamong10000,expectedgradeis~82%.

• Youhavesomany“chances”thatyouexpecttodowell.

• Butonnewquestionsthe“randomchoice”accuracyisstill25%.

FactorsAffectingOptimizationBias• Ifweinsteaduseda100-questiontestthen:

– Expectedgradefrombestover1randomly-filledtestis25%.– Expectedgradefrombestover2randomly-filledtestis~27%.– Expectedgradefrombestover10randomly-filledtestis~32%.– Expectedgradefrombestover100randomly-filledtestis~36%.– Expectedgradefrombestover1000randomly-filledtestis~40%.– Expectedgradefrombestover10000randomly-filledtestis~47%.

• Theoptimizationbiasgrowswiththenumberofthingswetry.– “Complexity”ofthesetofmodelswesearchover.

• But,optimizationbiasshrinksquicklywiththenumberofexamples.– Butit’sstillnon-zeroandgrowingifyouover-useyourvalidationset!

Summary• Trainingerrorvs.testingerror:

– Whatwecareaboutinmachinelearningisthetestingerror.• Goldenruleofmachinelearning:

– Thetestdatacannotinfluencetrainingthemodelinanyway.• Independentandidenticallydistributed(IID):

– Oneassumptionthatmakeslearningpossible.• Fundamentaltrade-off:

– Trade-offbetweengettinglowtrainingerrorandhavingtrainingerrorapproximatetesterror.• Validationset:

– Wecansavepartofourtrainingdatatoapproximatetesterror.• Hyper-parameters:

– Parametersthatcontrolmodelcomplexity,typicallysetwithavalidationset.

• Nexttime:– Wediscussthe“best”machinelearningmethod.

“testerror”vs.“testseterror”vs.“validationerror”

ApproximationErrorforSelectingHyper-Parameters

• Fromthe2019EasyMarkit AIHackathon:– “Weendedupselectingthehyperparametersthatgaveusthelowestapproximationerror(gapbetweentrainandvalidation)asopposedtothelowestvalidationerror.Thiswasquiteadifficultdecisionforourteamsincewewereonlyallowedonesubmission.However,themodelwiththelowestvalidationerrorhadaveryhighapproximationerror,whichfelttoorisky,sowewentwithamodelwithaslightlyhighervalidationerrorandmuchlowerapproximationerror.Whentheresultswereannounced,thereportedtestaccuracywaswithin0.1%ofwhatourmodelpredictedwiththevalidationset.”

• Thisisthetypeofreasoningyouwanttodo.– Ahighapproximationerrorcouldindicatelowvalidationerrorbychance.

“AvisualIntroductiontomachinelearning”• The“housingprices”exampleistakenfromthiswebsite:– http://www.r2d3.us/visual-intro-to-machine-learning-part-1

• Theyalsohavea“Part2”here:– http://www.r2d3.us/visual-intro-to-machine-learning-part-2

• Part2coverssimilartopicstowhatwecoveredinthislecture.

BoundingEapprox• Let’sassumewehaveafixedmodel‘h’(likeadecisiontree),andthenwecollectatrainingsetof‘n’examples.

• Whatistheprobabilitythattheerroronthistrainingset(Etrain),iswithinsomesmallnumberε ofthetesterror(Etest)?

• From“Hoeffding’s inequality”wehave:

• Thisisgreat!Inthissettingtheprobabilitythatourtrainingerrorisfarfromourtesterrorgoesdownexponentiallyintermsofthenumberofsamples‘n’.

BoundingEapprox• Unfortunately,thelastslidegetsitbackwards:

– Weusuallydon’tpickamodelandthencollectadataset.– Weusuallycollectadatasetandthenpickthemodel‘w’ basedonthedata.

• Wenowpickedthemodelthatdidbestonthedata,andHoeffding’sinequalitydoesn’taccountfortheoptimizationbiasofthisprocedure.

• Onewaytogetaroundthisistobound(Etest – Etrain)forallmodelsinthespaceofmodelsweareoptimizingover.– Ifwebounditforallmodels,thenwebounditforthebestmodel.– Thisgiveslooserbutcorrectbounds.

BoundingEapprox• Ifweonlyoptimizeoverafinitenumberofevents‘k’,wecanusethe“unionbound”thatforevents{A1,A2,…,Ak}wehave:

• CombiningHoeffding’s inequalityandtheunionboundgives:

BoundingEapprox• So,withtheoptimizationbiasofsetting“h*”tothebest‘h’among‘k’models,probabilitythat(Etest – Etrain)isbiggerthanε satisfies:

• Sooptimizingoverafewmodelsisokifwehavelotsofexamples.• Ifwetrylotsofmodelsthen(Etest – Etrain)couldbeverylarge.• Laterinthecoursewe’llbesearchingovercontinuousmodelswherek=infinity,sothisboundisuseless.

• Tohandlecontinuousmodels,onewayisviatheVC-dimension.– SimplermodelswillhavelowerVC-dimension.

RefinedFundamentalTrade-Off• LetEbest betheirreducibleerror(lowestpossibleerrorforanymodel).

• Forexample,irreducibleerrorforpredictingcoinflipsis0.5.

• SomelearningtheoryresultsuseEbest tofuther decomposeEtest:

• Thisissimilartothebias-variancedecomposition:– Term1:measureofvariance(howsensitivewearetotrainingdata).– Term2:measureofbias (howlowcanwemakethetrainingerror).– Term3:measureofnoise (howlowcananymodelmaketesterror).

RefinedFundamentalTrade-Off• Decisiontreewithhighdepth:– Verylikelytofitdatawell,sobiasislow.– Butmodelchangesalotifyouchangethedata,sovarianceishigh.

• Decisiontreewithlowdepth:– Lesslikelytofitdatawell,sobiasishigh.– Butmodeldoesn’tchangemuchyouchangedata,sovarianceislow.

• Anddegreedoesnotaffectirreducibleerror.– Irreducibleerrorcomesfromthebestpossiblemodel.

Bias-VarianceDecomposition• Youmayhaveseen“bias-variancedecomposition”inotherclasses:

– Assumes𝑦$ i =𝑦(i +ε,whereε hasmean0andvarianceσ2.– Assumeswehavea“learner”thatcantake‘n’trainingexamplesandusethesetomakepredictions𝑦%i.

• Expectedsquaredtesterrorinthissettingis

– Whereexpectationsaretakenoverpossibletrainingsets of‘n’examples.– Bias isexpectederrorduetohavingwrongmodel.– Variance isexpectederrorduetosensitivitytothetrainingset.– Noise (irreducibleerror)isthebestcanhopeforgiventhenoise(Ebest).

Bias-Variancevs.FundamentalTrade-Off• Bothdecompositionsservethesamepurpose:– Tryingtoevaluatehowdifferentfactorsaffecttesterror.

• Theybothleadtothesame3conclusions:1. SimplemodelscanhavehighEtrain/bias,lowEapprox/variance.2. ComplexmodelscanhavelowEtrain/bias,highEapprox/variance.3. Asyouincrease‘n’,Eapprox/variancegoesdown(forfixedcomplexity).

Bias-Variancevs.FundamentalTrade-Off• Sowhyfocusonfundamentaltrade-offandnotbias-variance?– Simplest viewpointthatgivesthese3conclusions.– Noassumptionslikebeingrestrictedtosquarederror.

– YoucanmeasureEtrain butnotEapprox (1knownand1unknown).• IfEtrain islowandyouexpectEapprox tobelow,thenyouarehappy.

– E.g.,youfitaverysimplemodeloryouusedahugeindependentvalidationset.

– Youcan’tmeasurebias,variance,ornoise(3unknowns).• IfEtrain islow,bias-variancedecompositiondoesn’tsayanythingabouttesterror.

– Youonlyhaveyourtrainingset,notdistributionoverpossibledatasets.– Doesn’tsayifhighEtest isduetobiasorvarianceornoise.

LearningTheory• Bias-variancedecompositionisabitweirdcomparedtoourpreviousdecompositionsofEtest:– Bias-variancedecompositionconsidersexpectationoverpossibletrainingsets.– Butdoesn’tsayanythingabouttesterrorwithyour trainingset.

• Somekeywordsifyouwanttolearnaboutlearningtheory:– Bias-variancedecomposition,samplecomplexity,probablyapproximatelycorrect(PAC)learning,Vapnik-Chernovenkis (VC)dimension,Rademacher complexity.

• Agentleplacetostartisthe“LearningfromData”book:– https://work.caltech.edu/telecourse.html

ATheoreticalAnswerto“HowMuchData?”• AssumewehaveasourceofIIDexamplesandafixedclassofparametricmodels.

• Like“alldepth-5decisiontrees”.

• Undersomenastyassumptions,with‘n’trainingexamplesitholdsthat:E[testerrorofbestmodelontrainingset]– (besttesterrorinclass)=O(1/n).

• Yourarelyknowtheconstantfactor,butthisgivessomeguidelines:– Addingmoredatahelpsmoreonsmalldatasetsthanonlargedatasets.

• Goingfrom10trainingexamplesto20,differencewithbestpossibleerrorgetscutinhalf.– Ifthebestpossibleerroris15%youmightgofrom20%to17.5%(thisdoesnotmean20%to10%).

• Goingfrom110trainingexamplesto120,erroronlygoesdownby~10%.• Goingfrom1Mtrainingexamplesto1M+10,youwon’tnoticeachange.

– Doublingthedatasizecutstheerrorinhalf:• Goingfrom1Mtrainingto2Mtrainingexamples,errorgetscutinhalf.• Ifyoudoublethedatasizeandyourtesterrordoesn’timprove,moredatamightnothelp.

CanyoutesttheIIDassumption?• Ingeneral,testingtheIIDassumptionisnoteasy.– Usually,youneedbackgroundknowledgetodecideifit’sreasonable.

• Sometestsdoexist,likeshufflingtheorderofdataandthenmeasuringifsomebasicstatisticsagree.– It’sreasonabletocheckifsummarystatisticsoftrainandtestdataagree.

• Ifnot,yourtrainedmodelmaynotbesouseful.

• Somediscussionhere:– https://stats.stackexchange.com/questions/28715/test-for-iid-sampling

CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L4.pdf · –Collect more data until...

Documents

Transcript of CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L4.pdf · –Collect more data until...