Practical Automated Machine · —Parashar. Foreword I vividly remember my first undergraduate...
Transcript of Practical Automated Machine · —Parashar. Foreword I vividly remember my first undergraduate...
PracticalAutomatedMachineLearningonAzure
UsingAzureMachineLearningtoQuicklyBuildAISolutions
DeepakMukunthu,ParasharShah,andWeeHyongTok
PracticalAutomatedMachineLearningonAzurebyDeepakMukunthu,ParasharShah,andWeeHyongTok
Copyright©2019DeepakMukunthu,ParasharShah,andWeeHyongTok.Allrightsreserved.
PrintedintheUnitedStatesofAmerica.
PublishedbyO’ReillyMedia,Inc.,1005GravensteinHighwayNorth,Sebastopol,CA95472.
O’Reillybooksmaybepurchasedforeducational,business,orsalespromotionaluse.Onlineeditionsarealsoavailableformosttitles(http://oreilly.com).Formoreinformation,contactourcorporate/institutionalsalesdepartment:[email protected].
AcquisitionsEditor:JonathanHassell
DevelopmentEditor:NicoleTaché
ProductionEditor:DeborahBaker
Copyeditor:OctalPublishing,LLC
Proofreader:SharonWilkey
Indexer:JudithMcConville
InteriorDesigner:DavidFutato
CoverDesigner:KarenMontgomery
Illustrator:RebeccaDemarest
September2019:FirstEdition
RevisionHistoryfortheFirstEdition
2019-09-20:FirstRelease
Seehttp://oreilly.com/catalog/errata.csp?isbn=9781492055594forreleasedetails.
TheO’ReillylogoisaregisteredtrademarkofO’ReillyMedia,Inc.PracticalAutomatedMachineLearningonAzure,thecoverimage,andrelatedtradedressaretrademarksofO’ReillyMedia,Inc.
Theviewsexpressedinthisworkarethoseoftheauthors,anddonotrepresentthepublisher’sviews.Whilethepublisherandtheauthorshaveusedgoodfaitheffortstoensurethattheinformationandinstructionscontainedinthisworkareaccurate,thepublisherandtheauthorsdisclaimallresponsibilityforerrorsoromissions,includingwithoutlimitationresponsibilityfordamagesresultingfromtheuseoforrelianceonthiswork.Useoftheinformationandinstructionscontainedinthisworkisatyourownrisk.Ifanycodesamplesorothertechnologythisworkcontainsordescribesissubjecttoopensourcelicensesortheintellectualpropertyrightsofothers,itisyourresponsibilitytoensurethatyourusethereofcomplieswithsuchlicensesand/orrights.
978-1-492-05559-4
[LSI]
Dedications
Dedicatedtomywife,kids,andparentsfortheirunconditionallove,encouragementandsupportineverythingIdo.—Deepak
Dedicatedtothewonderfulindividualsinmylife—Juliet,Nathaniel,andJayden.Mygratitudeandloveforthemisinfinite.—WeeHyong
IwouldliketothankmyparentsNitaandMahendraandmysisterVidhifortheirunconditionalloveandencouragementthroughoutmylife.IamthankfultomybuddiesatMicrosoft—Priya,Premal,Vicky,Martha,Savita,Deepti,andSagar—andmybuddiesoutsideofMicrosoft—Kevin,Ritu,Dhaval,Shamit,Priyadarshan,Pradip,andNikhil—fortheirlovingfriendship.—Parashar
Foreword
Ivividlyremembermyfirstundergraduateclassinartificialintelligence(AI).Myfatherhadworkedforyearson“expertsystems,”andIwasatMITtolearnfromthebesthowtoperformthiswizardry.MarvinMinsky,oneofthefoundersofthefield,eventaughtaseriesofguestlecturesthere.Itwasaboutmidwaythroughthesemesterwhenthegreatdisillusionmenthitme:“It’salljustabunchoftricks!”Therewasno“intelligence”tobefound;justabunchofbrittlerulesenginesandcleveruseofmath.Thiswasintheearly’90sandthestartofmyownpersonalAIwinter,whenIdismissedAIasnothavingmuchuse.
Yearslater,whileIwasworkingonadvertisingsystems,Ifinallysawthattherewaspowerinthis“bunchoftricks.”Algorithmsthathadbeenhand-tunedformonthsbytalentedengineerswerebeingbeatenbysimplemodelsprovidedwithlotsofdata.Isawthattheexplosionthatwastocomesimplyneededmoredataandmorecomputationtobeeffective.Overthepast5to10years,theexplosioninbothbigdataandcomputationpowerhasunleashedanindustrythathashadlotsofstartsandstopstoit.
Thistimeisdifferent.WhilethehypeaboutAIisstilltremendouslyhigh,thepotentialapplicationsofpracticalAIhavereallyjustbeguntohitthebusinessworld.TherulesorpeoplemakingpredictionstodaywillbereplacedvirtuallyeveryplacebyAIalgorithms.ThevalueAIcreatesforbusinessesistremendous,frombeingbetterabletovaluetheoilavailableinanoilfieldtobetterpredictingtheinventoryastoreshouldstockofeachnewsneaker.Evenmarginalimprovementsinthesecapabilitiesrepresentbillionsofdollarsofvalueacrossbusinesses.
We’renowinanageofAIimplementation.CompaniesareworkingtofindallthebestplacestodeployAIintheirenterprises.Oneofthebiggestchallengesismatchingthehypetoreality.HalfthecompaniesI’vetalkedtoexpectAItoperformsomekindofmagicforproblemstheyhavenoideahowtosolve.TheotherhalfareunderestimatingthepowerthatAIcanhave.WhattheyneedarepeoplewithenoughbackgroundinAItohelpthemconceiveofwhatispossibleandapplyittotheirbusinessproblems.
CustomersItalktoarestrugglingtofindenoughpeoplewiththoseskills.Whiletheyhavelotsofdevelopersanddataanalystswhoareskilledandcomfortablemakingpredictionsanddecisionswithdata,theyneeddatascientistswhocanthenbuildthemodelfromthatdata.Thisbookwillhelpfillthatgap.
ItshowshowautomatedMLcanempowerdevelopersanddataanalyststotrainAImodels.IthighlightsanumberofbusinesscaseswhereAIisagreatfittothebusinessproblemandshowexactlyhowtobuildthatmodelandputitintoproduction.Thetechnologyandideasinthisbookhavebeenpressure-testedatscalewithteamsallacrossMicrosoft,includingBing,Office,AzureSecurity,internalIT,andmanymore.It’salsobeenusedbymanyexternalbusinessesusingAzureMachineLearning.
EricBoydMicrosoftCorporateVicePresident,AzureAISeptember2019
Preface
ConventionsUsedinThisBookThefollowingtypographicalconventionsareusedinthisbook:
Italic
Indicatesnewterms,URLs,emailaddresses,filenames,andfileextensions.
Constant width
Usedforprogramlistings,aswellaswithinparagraphstorefertoprogramelementssuchasvariableorfunctionnames,databases,datatypes,environmentvariables,statements,andkeywords.
Constant width bold
Showscommandsorothertextthatshouldbetypedliterallybytheuser.
Constant width italic
Showstextthatshouldbereplacedwithuser-suppliedvaluesorbyvaluesdeterminedbycontext.
TIPThiselementsignifiesatiporsuggestion.
NOTEThiselementsignifiesageneralnote.
WARNINGThiselementindicatesawarningorcaution.
UsingCodeExamplesSupplementalmaterial(codeexamples,exercises,etc.)isavailablefordownloadathttps://oreil.ly/Practical_Automated_ML_on_Azure.
Thisbookisheretohelpyougetyourjobdone.Ingeneral,ifexamplecodeisofferedwiththisbook,youmayuseitinyourprogramsanddocumentation.Youdonotneedtocontactusforpermissionunlessyou’rereproducingasignificantportionofthecode.Forexample,writingaprogramthatusesseveralchunksofcodefromthisbookdoesnotrequirepermission.SellingordistributingaCD-ROMofexamplesfromO’Reillybooksdoesrequirepermission.Answeringaquestionbycitingthisbookandquotingexamplecodedoesnotrequirepermission.Incorporatingasignificantamountofexamplecodefromthisbookintoyourproduct’sdocumentationdoesrequirepermission.
Weappreciate,butdonotrequire,attribution.Anattributionusuallyincludesthetitle,author,publisher,andISBN.Forexample:“PracticalAutomatedMachineLearningonAzurebyDeepakMukunthu,ParasharShah,andWeeHyongTok(O’Reilly).Copyright2019DeepakMukunthu,ParasharShah,andWeeHyongTok,978-1-492-05559-4.”
Ifyoufeelyouruseofcodeexamplesfallsoutsidefairuseorthepermissiongivenabove,[email protected].
O’ReillyOnlineLearning
NOTEForalmost40years,O’ReillyMediahasprovidedtechnologyandbusinesstraining,knowledge,andinsighttohelpcompaniessucceed.
Ouruniquenetworkofexpertsandinnovatorssharetheirknowledgeandexpertisethroughbooks,articles,conferences,andouronlinelearningplatform.O’Reilly’sonlinelearningplatformgivesyouon-demandaccesstolivetrainingcourses,in-depthlearningpaths,interactivecodingenvironments,andavastcollectionoftextandvideofromO’Reillyand200+otherpublishers.Formore
information,pleasevisithttp://oreilly.com.
HowtoContactUsPleaseaddresscommentsandquestionsconcerningthisbooktothepublisher:
O’ReillyMedia,Inc.
1005GravensteinHighwayNorth
Sebastopol,CA95472
800-998-9938(intheUnitedStatesorCanada)
707-829-0515(internationalorlocal)
707-829-0104(fax)
Thewebpageforthisbooklistserrata,examples,andadditionalinformation.Youcanaccessthispageathttp://www.oreilly.com/catalog/9781492055594.
Tocommentorasktechnicalquestionsaboutthisbook,[email protected].
Formoreinformationaboutourbooks,courses,conferences,andnews,seeourwebsiteathttp://www.oreilly.com.
FindusonFacebook:http://facebook.com/oreilly
FollowusonTwitter:http://twitter.com/oreillymedia
WatchusonYouTube:http://www.youtube.com/oreillymedia
AcknowledgmentsThisbookwouldn’thavebeenpossiblewithoutgreatcontributionsfromthesefolks–thankyou!
WearethankfultoourcoworkersatMicrosoft(AzureAIproduct,marketing,
andmanyotherteams)forworkingtogethertodeliverthebestenterprise-readyAzureMachineLearningservice.
NicoloFusi,forsharingdetailsonresearchthatleadtothecreationofAutomatedML(Chapter2).
SharonGillett,fortextinputstoAutomatedMLintroduction(Chapter2).
VanessaMilan,forimagesforAutomatedMLintroduction(Chapter2).
AkcharaMukunthu,forexamplescenariosforMachineLearningtaskdetection(Table2-1inChapter2).
KrishnaAnumalasettyandThomasAbraham,fortechnicalreviewofthebook.
JenStirrup,forfeedbackonthebook.
TheamazingO’Reillyteam(NicoleTache,DeborahBaker,BobRussell,JonathanHassell,BenLorica,andmanymore),forworkingwithusfromconcepttoproductionandgivingustheopportunitytowriteandsharethebookwiththecommunity.
MembersoftheAzureMachineLearningandAzureCATteam,forthesupportiveenvironmentthatenabledtheauthorstowritethebookduringtheiroffhours,andmanyweekendsandholidays.
PartI.AutomatedMachineLearning
Inthispart,youwilllearnhowAutomatedMachineLearningcanhelpautomatemodeldevelopment.
Chapter1.MachineLearning:OverviewandBestPractices
Howarehumansdifferentfrommachines?Therearequiteafewdifferences,buthere’sanimportantone:humanslearnfromexperience,whereasmachinesfollowinstructionsgiventothem.Whatifmachinescanalsolearnfromexperience?Thatisthecruxofmachinelearning.Formachines,“datafromthepast”isthelogicalequivalentof“experience.”Machinelearningcombinesstatisticsandcomputersciencetoenablemachinestolearnhowtoperformagiventaskwithoutbeingexplicitlyprogrammedtodosoviainstructions.
Machinelearningiswidelyusedtoday,andweinteractwithiteveryday.Hereareafewexamplestoillustrate:
SearchengineslikeBingorGoogle
ProductrecommendationsatonlinestoreslikeAmazonoreBay
PersonalizedvideorecommendationsatNetflixorYouTube
Voice-baseddigitalassistantslikeAlexaorCortana
Spamfiltersforouremailinbox
Creditcardfrauddetection
Whyismachinelearningasatrendemergingsofast?Whyiseveryonesointerestedinitnow?AsshowninFigure1-1,itspopularityarisesfromthreekeytrends:bigdata,better/cheapercompute,andsmarteralgorithms.
Figure1-1.Machinelearninggrowth
Inthischapter,weprovideaquickrefresheronmachinelearningbyusingareal-worldexample,discusssomeofthebestpracticesthatdifferentiatesuccessfulmachinelearningprojectsfromtherest,andendwithchallengesaroundproductivityandscale.
MachineLearning:AQuickRefresherWhatdoestheprocessofbuildingamachinelearningmodellooklike?Let’sdigdeeperusingarealscenario:housepriceprediction.Wehavepasthomesalesdata,andthetaskistopredictthesalepriceforagivenhousethatjustcameontothemarketandisn’tcurrentlyinourdataset.
Forsimplicity,let’sassumethatthesizeofthehouse(insquarefeet)isthemostimportantinputattribute(orfeature)thatdetermineshousevalue.AsshowninTable1-1,wehavedatafromfourhouses,A,B,C,D,andweneedtopredictthepriceofhouseX.
Table1-1.Housepricesbasedonsize
House Size(sq.ft) Price($)
A 1300 500,000
B 2000 800,000
C 2500 950,000
D 3200 1,200,000
X 1800 ?
WebeginbyplottingSizeonthex-axisandPriceonthey-axis,asshowninFigure1-2.
Figure1-2.Plottingpriceversussize
What’sthebestestimateforthepriceofhouseX?
$550,000
$700,000
$1,000,000
Let’sfigureitout.AsshowninFigure1-3,thefourpointsthatweplottedbasedonthedataformanalmoststraightline.Ifwedrawthislinethatbestfitsourdata,wecanfindtherightpointonthelineassociatedwithhouseXonthex-axisandthecorrespondingpointony-axis,whichwillgiveusourpriceestimate.
Figure1-3.Creatingastraightlinetofindpriceestimate
Inthiscase,thatstraightlinerepresentsourmodel—anddemonstratesalinearrelationship.Linearregressionisastatisticalapproachformodelingalinearrelationshipbetweeninputvariables(alsocalledfeature,orindependent,variables)andanoutputvariable(alsocalledatarget,ordependent,variable).Mathematically,thislinearrelationshipcanberepresentedasfollows:
where:
yistheoutputvariable;forexample,thehouseprice.
xistheinputvariable;forexample,sizeinsquarefeet.
β0istheintercept(thevalueofywhenx=0).
β1isthecoefficientforxandtheslopeoftheregressionline(“theaverageincreaseinyassociatedwithaone-unitincreaseinx).
ModelParametersβ0andβ1areknownasthemodelparametersofthislinearregressionmodel.
Whenimplementinglinearregression,thealgorithmfindsthelineofbestfitbyusingthemodelparametersβ0andβ1,suchthatitisascloseaspossibletotheactualdatapoints(minimizingthesumofthesquareddistancesbetweeneachactualdatapointandthelinerepresentingmodelpredictions).
Figure1-4showsthisconceptually.Dotsrepresentactualdatapoints,andthelinerepresentsthemodelpredictions.d1tod9representdistancesbetweendatapointsandthecorrespondingmodelprediction,andDisthesumoftheirsquares.Thelineshowninthefigureisthebest-fitregressionlinethatminimizesD.
Figure1-4.Regression
Asyoucansee,modelparametersareanintegralpartofthemodelanddeterminetheoutcome.Theirvaluesarelearnedfromdatathroughthemodeltrainingprocess.
HyperparametersThereisanothersetofparametersknownashyperparameters.Modelhyperparametersareusedduringthemodeltrainingprocesstoestablishthecorrectvaluesofmodelparameters.Theyareexternaltothemodel,andtheirvaluescannotbeestimatedfromdata.Thechoiceofthehyperparameterswill
affectthedurationofthetrainingandtheaccuracyofthepredictions.Aspartofthemodeltrainingprocess,datascientistsusuallyspecifyhyperparametersbasedonheuristicsorknowledge,andoftentunethehyperparametersmanually.Hyperparametertuningreliesmoreonexperimentalresultsthantheory,andthusthebestmethodtodeterminetheoptimalsettingsistotrymanycombinationsandevaluatetheperformanceofeachmodel.
Simplelinearregressiondoesn’thaveanyhyperparameters.Butvariantsoflinearregression,likeRidgeregressionandLasso,do.Herearesomeexamplesofmodelhyperparametersforvariousmachinelearningalgorithms:
Thekink-nearestneighbors
Thedesireddepthandnumberofleavesinadecisiontree
TheCandsigmainsupportvectormachines(SVMs)
Thelearningrateforaneuralnetworktraining
BestPracticesforMachineLearningProjectsInthissection,weexaminebestpracticesthatmakemachinelearningprojectssuccessful.Thesearepracticaltipsthatmostcompaniesandteamsenduplearningwithexperience.
UnderstandtheDecisionProcessMachinelearning–basedsystemsorprocessesusedatatodrivebusinessdecisions.Hence,itisimportanttounderstandthebusinessproblemthatneedstobesolved,independentoftechnologysolutions—inotherwords,whatdecisionoractionneedstobetakenthatcanbeinformedbydata.Beingclearaboutthedecisionprocessiscritical.Thisstepisalsosometimesreferredtoasmappingabusinessscenario/problemtoadatasciencequestion.
Forourhouse-pricepredictionscenario,thekeybusinessdecisionforahomebuyer,is“ShouldIbuyagivenhouseatthelistedprice?”or“Whatisagoodbidpriceforthishousetomaximizemychanceofwinningthebid?”Thiscouldbemappedtothedatasciencequestion:“Whatisthebestestimateofthehousepricebasedonpastsalesdataofotherhouses?”
Table1-2showsotherreal-worldbusinessscenariosandwhatthisdecisionprocesslookslike.
Table1-2.Understandingadecisionprocess:real-worldscenarios
Businessscenario Keydecision Datasciencequestion
Predictivemaintenance
ShouldIservicethispieceofequipment?
Whatistheprobabilitythisequipmentwillfailwithinthenextxdays?
Energyforecasting
ShouldIbuyorsellenergycontracts?
Whatwillbethelong-/short-termdemandforenergyinaregion?
Customerchurn
WhichcustomersshouldIprioritizetoreducechurn?
Whatistheprobabilityofchurnwithinxdaysforeachcustomer?
Personalizedmarketing WhatproductshouldIofferfirst? Whatistheprobabilitythatcustomerswill
purchaseeachproduct?
Productfeedback
Whichservice/productneedsattention?
Whatisthesocialmediasentimentforeachservice/product?
EstablishPerformanceMetricsAswithanyproject,performancemetricsareimportanttoguideanymachinelearningprojecttowardthepropergoalsandtoensureprogressismade.Afterweunderstandthedecisionprocess,thenextstepistoanswerthesetwokeyquestions:
Howdowemeasureprogresstowardagoalordesiredoutcome?Inotherwords,howdowedefinemetricstoevaluateprogress?
Whatwouldbeconsideredasuccess?Thatis,howdowedefinetargetsforthemetricsdefined?
Forourhouse-pricepredictionexample,weneedametrictomeasurehowcloseourpredictionsaretotheactualprice.Therearequiteafewmetricstochoosefrom.Oneofthemostcommonlyusedmetricsforregressiontasksisroot-mean-squareerror(RMSE).Thisisdefinedasthesquarerootoftheaveragesquareddistancebetweentheactualscoreandthepredictedscore,asshownhere:
Here,y denotesthetruevalueforthei datapoint,andŷ denotesthepredictedvalue.OneintuitivewaytounderstandthisformulaisthatitistheEuclideandistancebetweenthevectorofthetruevaluesandthevectorofthepredictedvalues,averagedbyn,wherenisthenumberofdatapoints.
FocusonTransparencytoGainTrustThereisacommonperceptionthatmachinelearningisablackboxthatjustworksmagically.Itiscriticaltounderstandthatalthoughmodelperformanceasmeasuredbymetricsisimportant,itisevenmoreimportantforustounderstandhowthemodelworks.Withoutthisunderstanding,itisdifficulttotrustthemodelandthereforedifficulttoconvincekeystakeholdersandcustomersofthebusinessvalueofmachinelearningandmachinelearning–basedsystems.
Inheavilyregulatedindustrieslikehealthcareandbanking,whicharerequiredtocomplywithregulation,interpretabilityofmodelsiscritical.Modelinterpretabilityistypicallyrepresentedbyfeatureimportance,whichtellsyouhoweachinputcolumn(orfeature)affectsthemodel’spredictions.Thisallowsdatascientiststoexplainresultingpredictionssothatstakeholderscanseewhichdatapointsaremostimportantinthemodel.
Inourhouse-pricepredictionscenario,ourtrustonthemodelwouldincreaseifthemodel,inadditiontopriceprediction,indicatedkeyinputfeaturesthatcontributedtotheoutput;forexample,housesizeandage.Figure1-5showsfeatureimportanceforourhouse-pricepredictionscenario.Noticethatageandschoolratingarethetopmostfeatures.
jth
j
Figure1-5.Featureimportance
EmbraceExperimentationBuildingagoodmachinelearningmodeltakestime.Aswithothersoftwareprojects,thetricktobecomingsuccessfulinmachinelearningprojectsliesinhowfastwetryoutnewhypotheses,learnfromthem,andkeepevolving.AsshowninFigure1-6,thepathtosuccessisn’tusuallyeasyandrequiresalotofpersistence,duediligence,andfailuresontheway.
Figure1-6.Successisnoteasy.
Herearekeyaspectsofaculturethatvaluesexperimentation:
Bewillingtolearnfromexperiments(successesorfailures).
Sharethelearningwithpeers.
Promotesuccessfulexperimentstoproduction.
Understandthatfailureisavalidoutcomeofanexperiment.
Quicklymoveontothenexthypothesis.
Refinethenextexperiment.
Don’tOperateinaSiloCustomerstypicallyexperiencemachinelearningmodelsthroughapplications.Figure1-7showshowmachinelearningsystemsaredifferentfromtraditionalsoftwaresystems.Thekeydifferenceisthatmachinelearningsystems,inadditiontocodeworkflow,mustalsoconsiderdataworkflow.
Figure1-7.Machinelearningsystemversustraditionalsystems
Afterdatascientistshavebuiltamachinelearningmodelthatissatisfactorytothem,theyhanditofftoanappdeveloperwhointegratesitintothelargerapplicationanddeploysit.Often,anybugsorperformanceissuesgoundiscovereduntiltheapplicationhasalreadybeendeployed.Theresultingfrictionbetweenappdevelopersanddatascientiststoidentifyandfixtherootcausecanbeaslow,frustrating,andexpensiveprocess.
Asmachinelearningentersmorebusiness-criticalapplications,itisincreasinglyclearthatdatascientistsneedtocollaboratecloselywithappdeveloperstobuildanddeploymachinelearning–poweredapplicationsmoreefficiently.Datascientistsarefocusedonthedatasciencelifecycle;namely,dataingestionandpreparation,modelbuilding,anddeployment.Theyarealsointerestedinperiodicallyretrainingandredeployingthemodeltoadjustforfreshlylabeleddata,datadrift,userfeedback,orchangesinmodelinputs.Theappdeveloperisfocusedontheapplicationlifecycle—building,maintaining,andcontinuouslyupdatingthelargerbusinessapplicationthatthemodelispartof.Bothpartiesaremotivatedtomakethebusinessapplicationandmodelworkwelltogethertomeetend-to-endperformance,quality,andreliabilitygoals.
Whatisneededisawaytobridgethedatascienceandapplicationlifecyclesmoreeffectively.Figure1-8showshowthiscollaborationcouldbeenabled.Wewillcoverthisinmoredepthlaterinthebook.
Figure1-8.Appdeveloperanddatascientistworkingtogether
AnIterativeandTime-ConsumingProcessInthissection,wedigdeeperintothemachinelearningprocessbyusingourhouse-pricepredictionexample.Westartedwithhousesizeastheonlyinput,andwesawtherelationshipbetweenhousesizeandhousepricetobelinear.Tocreateagoodmodelthatcanpredictpricesmoreaccurately,weneedtoexplore
goodinputfeatures,selectthebestalgorithm,andtunehyperparametervalues.But,howdoyouknowwhichfeaturesaregood,andwhichalgorithmandhyperparametervalueswilldothebest?Thereisnosilverbullethere;wewillneedtotryoutdifferentcombinationsoffeatures,algorithms,andhyperparametervalues.Let’stakealookateachofthesethreestepsandthenseehowtheyapplytoourhouse-pricepredictionproblem.
FeatureEngineeringFeatureengineeringistheprocessofusingourknowledgeofthedatatocreatefeaturesthatmakemachinelearningalgorithmswork.AsshowninFigure1-9,thisinvolvesfoursteps.
Figure1-9.Featureengineering
First,weacquiredata—collectthedatawithallofthesepossibleinputvariables/featuresandgetittoausablestate.Mostreal-worlddatasetsarenotclean,andneedworktogetthedatatoalevelofqualitybeforeusingit.Thiscaninvolvethingssuchasfixingmissingvalues,removinganomaliesandpossiblyincorrectdata,andensuringthedatadistributionisrepresentative.
Nextyou’llneedtogeneratefeatures:exploregeneratingmorefeaturesfromavailabledata.Thisistypicallyusefulwhendealingwithtextdataortime-seriesdata.Text-relatedfeaturescouldbeassimpleasn-gramsandcountvectorizationorasadvancedassentimentfromreviewtext.Similarly,time-relatedfeaturescouldbeassimpleasmonthandweek-index-of-yearorascomplexastime-basedaggregations.Theseadditionalfeaturesgeneratedcanprovehelpfulinimprovingaccuracyofthemodel.
Withthiscomplete,you’llneedtotransformthedatatomakeitsuitableformachinelearning.Often,machinelearningalgorithmsrequirethatdatabe
preparedinspecificwaysbeforefittingamachinelearningmodel.Forinstance,manysuchalgorithmscannotoperateoncategoricaldatadirectly,andrequireallinputvariablesandoutputvariablesbenumeric.Acategoricalvariableisavariablethatcantakeononeofalimited,andusuallyfixed,numberofpossiblevalues.Examplesofthesevariablesincludecolor(red,blue,green,etc.),country(UnitedStates,India,China,etc.),andbloodgroup(A,B,O,AB).Categoricalvariablesmustbeconvertedtoanumericalform,whichistypicallydonebyusingintegerencodingorone-hotencodingtechniques.
Thefinalstepisfeatureselection:choosingasubsetoffeaturestotrainthemodelon.Whyisthisnecessary?Whynottrainthemodelwiththefullsetoffeatures?Featureselectionidentifiesandremovestheunneeded,irrelevant,andredundantattributesfromdatathatdon’tcontribute,orcaninfactdecrease,themodel’saccuracy.Theobjectiveoffeatureselectionisthreefold:
Improvemodelaccuracy
Improvemodeltrainingtime/cost
Provideabetterunderstandingoftheunderlyingprocessoffeaturegeneration
NOTEFeatureengineeringstepsarecriticalfortraditionalmachinelearningbutnotsomuchfordeeplearning,becausefeaturesareautomaticallygenerated/inferredthroughthedeeplearningnetwork.
Webeganwithasinglefeature:housesize.Butweknowthatthepriceofahouseisdependentnotonlyonsize,butalsoonothercharacteristics.Whatotherinputfeaturescouldinfluencehouseprice?Althoughsizemightbeoneofthemostimportantinputs,herearefewmoreworthconsidering:
Zipcode
Yearbuilt
Lotsize
Schools
Numberofbedrooms
Numberofbathrooms
Numberofgaragestalls
Amenities
AlgorithmSelectionAfterwehavechosenagoodsetoffeatures,thenextstepistodeterminethecorrectalgorithmforthemodel.Forthedatawehave,asimplelinearregressionmodelmightseemtowork.Butrememberthatwehaveonlyafewdatapoints(fourhouseswithprice)—smallenoughtoberepresentativeandsmallenoughformachinelearning.Also,linearregressionassumesalinearrelationbetweeninputfeaturesandtargetvariable.Aswecollectmoredatapoints,linearregressionmightnotremainmostrelevant,andwewillbemotivatedtoexploreothertechniques(algorithms)dependingontrendsandpatternsindata.
HyperparameterTuningAsdiscussedearlierinthischapter,hyperparametersplayakeyroleinmodelaccuracyandtrainingperformance.Hence,tuningthemisacriticalstepingettingtoagoodmodel.Becausedifferentalgorithmshavedifferentsetsofhyperparameters,thisstepoftuninghyperparametersaddstothecomplexityoftheend-to-endprocess.
TheEnd-to-EndProcessWiththatbasicunderstandingoffeatureengineering,algorithmselection,andhyperparametertuning,let’sgostepbystepthroughourhouse-pricepredictionproblem.
Let’sbeginwithSize,Lotsize,andYearbuiltfeaturesandGradientBoostedtreeswithspecifichyperparametervalues,asshowninFigure1-10.Theresultingmodelis30%accurate.Butwewanttodobetterthanthat.
Figure1-10.Machinelearningprocess:step1
Togetunderway,wetrydifferentvaluesofhyperparametersforthesamesetoffeaturesandalgorithm.Ifthatdoesn’timproveaccuracyofthemodeltoasatisfactorylevel,wetrydifferentalgorithms,andifthatdoesn’thelpeither,weaddmorefeatures.Figure1-11showsonesuchintermediatestate,withSchooladdedasafeatureandthek-nearestneighbors(KNN)algorithmused.Theresultingmodelis50%accuratebutstillnotgoodenough,sowecontinuethisprocessandtrydifferentcombinations.
Figure1-11.Machinelearningprocess:intermediatestate
Aftermultipleiterationsoftryingoutdifferentcombinationsoffeatures,algorithms,andhyperparametervalues,weendupwithamodelthatmeetsourcriteria,asshowninFigure1-12.
Figure1-12.Machinelearningprocess:bestmodel
Asyoucansee,thisisaniterativeandtime-consumingprocess.Toputthisinperspective:ifthereare10features,thereareatotalof2 (1,024)waystoselectfeatures.Ifwetryfivealgorithms,andassumingeachhasanaverageoffivehyperparameters,wearelookingatatotalof1,024×5×5=25,600iterations!
Figure1-13showsthescikit-learncheatsheetdemonstratingthatchoosingtheproperalgorithmcouldbeacomplexprobleminitself.Nowimagineaddingfeatureengineeringandhyperparametertuningontopofit.Asaresult,ittakesdatascientistsanywherefromacoupleofweekstomonthstoarriveatagoodmodel.
Figure1-13.Scikit-learnalgorithmcheatsheet(source:https://oreil.ly/xUZbU)
10
GrowingDemandDespitethecomplexityofthemodel-buildingprocess,demandformachinelearninghasskyrocketed.Mostorganizationsacrossallindustriesaretryingtousedataandmachinelearningtogainacompetitiveadvantage—infusingintelligenceintotheirproductsandprocessestodelightcustomersandamplifybusinessimpact.Figure1-14showsthevarietyofreal-worldbusinessproblemsbeingsolvedusingmachinelearning.
Figure1-14.Real-worldbusinessproblemsusingmachinelearning
Asaresult,thereishugedemandformachinelearning–relatedjobs.Figure1-15showsthepercentagegrowthinvariousjobpostingsfrom2015to2018.
Figure1-15.Growthinmachinelearning–relatedjobs
AndFigure1-16showstheexpectedrevenuefromenterpriseapplicationsusingmachinelearningandartificialintelligencegrowingastronomically.
Figure1-16.Machinelearning/artificialintelligencerevenueprojections
ConclusionInthischapter,youlearnedsomeofthebestpracticesthatsuccessfulmachinelearningprojectshaveincommon.Wediscussedthattheprocessofbuildingagoodmachinelearningmodelisiterativeandtime-consuming,resultingindatascientistsrequiringanywherefromacoupleofweekstomonthstobuildagoodmodel.Atthesametime,demandformachinelearningisgrowingrapidlyandisexpectedtoskyrocket.
Tobalancethissupply-versus-demandproblem,thereneedstobeabetterwaytoshortenthetimeittakestobuildmachinelearningmodels.Cansomeofthestepsinthatworkflowbeautomated?Absolutely!AutomatedMachineLearningisoneofthemostimportantskillsthatsuccessfuldatascientistsneedtohaveintheirtoolboxforimprovedproductivity.
Inthefollowingchapterswe’llgodeeperintoAutomatedMachineLearning.We
willexplorewhatitis,howtogetstarted,andhowitisbeingusedinreal-worldapplicationstoday.