Practical Automated Machine · —Parashar. Foreword I vividly remember my first undergraduate...

PracticalAutomatedMachineLearningonAzure

UsingAzureMachineLearningtoQuicklyBuildAISolutions

DeepakMukunthu,ParasharShah,andWeeHyongTok

PracticalAutomatedMachineLearningonAzurebyDeepakMukunthu,ParasharShah,andWeeHyongTok

Copyright©2019DeepakMukunthu,ParasharShah,andWeeHyongTok.Allrightsreserved.

PrintedintheUnitedStatesofAmerica.

PublishedbyO’ReillyMedia,Inc.,1005GravensteinHighwayNorth,Sebastopol,CA95472.

O’Reillybooksmaybepurchasedforeducational,business,orsalespromotionaluse.Onlineeditionsarealsoavailableformosttitles(http://oreilly.com).Formoreinformation,contactourcorporate/institutionalsalesdepartment:[email protected].

AcquisitionsEditor:JonathanHassell

DevelopmentEditor:NicoleTaché

ProductionEditor:DeborahBaker

Copyeditor:OctalPublishing,LLC

Proofreader:SharonWilkey

Indexer:JudithMcConville

InteriorDesigner:DavidFutato

CoverDesigner:KarenMontgomery

Illustrator:RebeccaDemarest

September2019:FirstEdition

RevisionHistoryfortheFirstEdition

2019-09-20:FirstRelease

http://oreilly.com

Seehttp://oreilly.com/catalog/errata.csp?isbn=9781492055594forreleasedetails.

TheO’ReillylogoisaregisteredtrademarkofO’ReillyMedia,Inc.PracticalAutomatedMachineLearningonAzure,thecoverimage,andrelatedtradedressaretrademarksofO’ReillyMedia,Inc.

Theviewsexpressedinthisworkarethoseoftheauthors,anddonotrepresentthepublisher’sviews.Whilethepublisherandtheauthorshaveusedgoodfaitheffortstoensurethattheinformationandinstructionscontainedinthisworkareaccurate,thepublisherandtheauthorsdisclaimallresponsibilityforerrorsoromissions,includingwithoutlimitationresponsibilityfordamagesresultingfromtheuseoforrelianceonthiswork.Useoftheinformationandinstructionscontainedinthisworkisatyourownrisk.Ifanycodesamplesorothertechnologythisworkcontainsordescribesissubjecttoopensourcelicensesortheintellectualpropertyrightsofothers,itisyourresponsibilitytoensurethatyourusethereofcomplieswithsuchlicensesand/orrights.

978-1-492-05559-4

[LSI]

http://oreilly.com/catalog/errata.csp?isbn=9781492055594

Dedications

Dedicatedtomywife,kids,andparentsfortheirunconditionallove,encouragementandsupportineverythingIdo.—Deepak

Dedicatedtothewonderfulindividualsinmylife—Juliet,Nathaniel,andJayden.Mygratitudeandloveforthemisinfinite.—WeeHyong

IwouldliketothankmyparentsNitaandMahendraandmysisterVidhifortheirunconditionalloveandencouragementthroughoutmylife.IamthankfultomybuddiesatMicrosoft—Priya,Premal,Vicky,Martha,Savita,Deepti,andSagar—andmybuddiesoutsideofMicrosoft—Kevin,Ritu,Dhaval,Shamit,Priyadarshan,Pradip,andNikhil—fortheirlovingfriendship.—Parashar

Foreword

Ivividlyremembermyfirstundergraduateclassinartificialintelligence(AI).Myfatherhadworkedforyearson“expertsystems,”andIwasatMITtolearnfromthebesthowtoperformthiswizardry.MarvinMinsky,oneofthefoundersofthefield,eventaughtaseriesofguestlecturesthere.Itwasaboutmidwaythroughthesemesterwhenthegreatdisillusionmenthitme:“It’salljustabunchoftricks!”Therewasno“intelligence”tobefound;justabunchofbrittlerulesenginesandcleveruseofmath.Thiswasintheearly’90sandthestartofmyownpersonalAIwinter,whenIdismissedAIasnothavingmuchuse.

Yearslater,whileIwasworkingonadvertisingsystems,Ifinallysawthattherewaspowerinthis“bunchoftricks.”Algorithmsthathadbeenhand-tunedformonthsbytalentedengineerswerebeingbeatenbysimplemodelsprovidedwithlotsofdata.Isawthattheexplosionthatwastocomesimplyneededmoredataandmorecomputationtobeeffective.Overthepast5to10years,theexplosioninbothbigdataandcomputationpowerhasunleashedanindustrythathashadlotsofstartsandstopstoit.

Thistimeisdifferent.WhilethehypeaboutAIisstilltremendouslyhigh,thepotentialapplicationsofpracticalAIhavereallyjustbeguntohitthebusinessworld.TherulesorpeoplemakingpredictionstodaywillbereplacedvirtuallyeveryplacebyAIalgorithms.ThevalueAIcreatesforbusinessesistremendous,frombeingbetterabletovaluetheoilavailableinanoilfieldtobetterpredictingtheinventoryastoreshouldstockofeachnewsneaker.Evenmarginalimprovementsinthesecapabilitiesrepresentbillionsofdollarsofvalueacrossbusinesses.

We’renowinanageofAIimplementation.CompaniesareworkingtofindallthebestplacestodeployAIintheirenterprises.Oneofthebiggestchallengesismatchingthehypetoreality.HalfthecompaniesI’vetalkedtoexpectAItoperformsomekindofmagicforproblemstheyhavenoideahowtosolve.TheotherhalfareunderestimatingthepowerthatAIcanhave.WhattheyneedarepeoplewithenoughbackgroundinAItohelpthemconceiveofwhatispossibleandapplyittotheirbusinessproblems.

CustomersItalktoarestrugglingtofindenoughpeoplewiththoseskills.Whiletheyhavelotsofdevelopersanddataanalystswhoareskilledandcomfortablemakingpredictionsanddecisionswithdata,theyneeddatascientistswhocanthenbuildthemodelfromthatdata.Thisbookwillhelpfillthatgap.

ItshowshowautomatedMLcanempowerdevelopersanddataanalyststotrainAImodels.IthighlightsanumberofbusinesscaseswhereAIisagreatfittothebusinessproblemandshowexactlyhowtobuildthatmodelandputitintoproduction.Thetechnologyandideasinthisbookhavebeenpressure-testedatscalewithteamsallacrossMicrosoft,includingBing,Office,AzureSecurity,internalIT,andmanymore.It’salsobeenusedbymanyexternalbusinessesusingAzureMachineLearning.

EricBoydMicrosoftCorporateVicePresident,AzureAISeptember2019

Preface

ConventionsUsedinThisBookThefollowingtypographicalconventionsareusedinthisbook:

Italic

Indicatesnewterms,URLs,emailaddresses,filenames,andfileextensions.

Constant width

Usedforprogramlistings,aswellaswithinparagraphstorefertoprogramelementssuchasvariableorfunctionnames,databases,datatypes,environmentvariables,statements,andkeywords.

Constant width bold

Showscommandsorothertextthatshouldbetypedliterallybytheuser.

Constant width italic

Showstextthatshouldbereplacedwithuser-suppliedvaluesorbyvaluesdeterminedbycontext.

TIPThiselementsignifiesatiporsuggestion.

NOTEThiselementsignifiesageneralnote.

WARNINGThiselementindicatesawarningorcaution.

UsingCodeExamplesSupplementalmaterial(codeexamples,exercises,etc.)isavailablefordownloadathttps://oreil.ly/Practical_Automated_ML_on_Azure.

Thisbookisheretohelpyougetyourjobdone.Ingeneral,ifexamplecodeisofferedwiththisbook,youmayuseitinyourprogramsanddocumentation.Youdonotneedtocontactusforpermissionunlessyou’rereproducingasignificantportionofthecode.Forexample,writingaprogramthatusesseveralchunksofcodefromthisbookdoesnotrequirepermission.SellingordistributingaCD-ROMofexamplesfromO’Reillybooksdoesrequirepermission.Answeringaquestionbycitingthisbookandquotingexamplecodedoesnotrequirepermission.Incorporatingasignificantamountofexamplecodefromthisbookintoyourproduct’sdocumentationdoesrequirepermission.

Weappreciate,butdonotrequire,attribution.Anattributionusuallyincludesthetitle,author,publisher,andISBN.Forexample:“PracticalAutomatedMachineLearningonAzurebyDeepakMukunthu,ParasharShah,andWeeHyongTok(O’Reilly).Copyright2019DeepakMukunthu,ParasharShah,andWeeHyongTok,978-1-492-05559-4.”

Ifyoufeelyouruseofcodeexamplesfallsoutsidefairuseorthepermissiongivenabove,[email protected].

O’ReillyOnlineLearning

NOTEForalmost40years,O’ReillyMediahasprovidedtechnologyandbusinesstraining,knowledge,andinsighttohelpcompaniessucceed.

Ouruniquenetworkofexpertsandinnovatorssharetheirknowledgeandexpertisethroughbooks,articles,conferences,andouronlinelearningplatform.O’Reilly’sonlinelearningplatformgivesyouon-demandaccesstolivetrainingcourses,in-depthlearningpaths,interactivecodingenvironments,andavastcollectionoftextandvideofromO’Reillyand200+otherpublishers.Formore

https://oreil.ly/Practical_Automated_ML_on_Azure

mailto:[email protected]

http://oreilly.com

information,pleasevisithttp://oreilly.com.

HowtoContactUsPleaseaddresscommentsandquestionsconcerningthisbooktothepublisher:

O’ReillyMedia,Inc.

1005GravensteinHighwayNorth

Sebastopol,CA95472

800-998-9938(intheUnitedStatesorCanada)

707-829-0515(internationalorlocal)

707-829-0104(fax)

Thewebpageforthisbooklistserrata,examples,andadditionalinformation.Youcanaccessthispageathttp://www.oreilly.com/catalog/9781492055594.

Tocommentorasktechnicalquestionsaboutthisbook,[email protected].

Formoreinformationaboutourbooks,courses,conferences,andnews,seeourwebsiteathttp://www.oreilly.com.

FindusonFacebook:http://facebook.com/oreilly

FollowusonTwitter:http://twitter.com/oreillymedia

WatchusonYouTube:http://www.youtube.com/oreillymedia

AcknowledgmentsThisbookwouldn’thavebeenpossiblewithoutgreatcontributionsfromthesefolks–thankyou!

WearethankfultoourcoworkersatMicrosoft(AzureAIproduct,marketing,

http://oreilly.com

http://www.oreilly.com/catalog/9781492055594

mailto:[email protected]

http://www.oreilly.com

http://facebook.com/oreilly

http://twitter.com/oreillymedia

http://www.youtube.com/oreillymedia

andmanyotherteams)forworkingtogethertodeliverthebestenterprise-readyAzureMachineLearningservice.

NicoloFusi,forsharingdetailsonresearchthatleadtothecreationofAutomatedML(Chapter2).

SharonGillett,fortextinputstoAutomatedMLintroduction(Chapter2).

VanessaMilan,forimagesforAutomatedMLintroduction(Chapter2).

AkcharaMukunthu,forexamplescenariosforMachineLearningtaskdetection(Table2-1inChapter2).

KrishnaAnumalasettyandThomasAbraham,fortechnicalreviewofthebook.

JenStirrup,forfeedbackonthebook.

TheamazingO’Reillyteam(NicoleTache,DeborahBaker,BobRussell,JonathanHassell,BenLorica,andmanymore),forworkingwithusfromconcepttoproductionandgivingustheopportunitytowriteandsharethebookwiththecommunity.

MembersoftheAzureMachineLearningandAzureCATteam,forthesupportiveenvironmentthatenabledtheauthorstowritethebookduringtheiroffhours,andmanyweekendsandholidays.

PartI.AutomatedMachineLearning

Inthispart,youwilllearnhowAutomatedMachineLearningcanhelpautomatemodeldevelopment.

Chapter1.MachineLearning:OverviewandBestPractices

Howarehumansdifferentfrommachines?Therearequiteafewdifferences,buthere’sanimportantone:humanslearnfromexperience,whereasmachinesfollowinstructionsgiventothem.Whatifmachinescanalsolearnfromexperience?Thatisthecruxofmachinelearning.Formachines,“datafromthepast”isthelogicalequivalentof“experience.”Machinelearningcombinesstatisticsandcomputersciencetoenablemachinestolearnhowtoperformagiventaskwithoutbeingexplicitlyprogrammedtodosoviainstructions.

Machinelearningiswidelyusedtoday,andweinteractwithiteveryday.Hereareafewexamplestoillustrate:

SearchengineslikeBingorGoogle

ProductrecommendationsatonlinestoreslikeAmazonoreBay

PersonalizedvideorecommendationsatNetflixorYouTube

Voice-baseddigitalassistantslikeAlexaorCortana

Spamfiltersforouremailinbox

Creditcardfrauddetection

Whyismachinelearningasatrendemergingsofast?Whyiseveryonesointerestedinitnow?AsshowninFigure1-1,itspopularityarisesfromthreekeytrends:bigdata,better/cheapercompute,andsmarteralgorithms.

Figure1-1.Machinelearninggrowth

Inthischapter,weprovideaquickrefresheronmachinelearningbyusingareal-worldexample,discusssomeofthebestpracticesthatdifferentiatesuccessfulmachinelearningprojectsfromtherest,andendwithchallengesaroundproductivityandscale.

MachineLearning:AQuickRefresherWhatdoestheprocessofbuildingamachinelearningmodellooklike?Let’sdigdeeperusingarealscenario:housepriceprediction.Wehavepasthomesalesdata,andthetaskistopredictthesalepriceforagivenhousethatjustcameontothemarketandisn’tcurrentlyinourdataset.

Forsimplicity,let’sassumethatthesizeofthehouse(insquarefeet)isthemostimportantinputattribute(orfeature)thatdetermineshousevalue.AsshowninTable1-1,wehavedatafromfourhouses,A,B,C,D,andweneedtopredictthepriceofhouseX.

Table1-1.Housepricesbasedonsize

House Size(sq.ft) Price($)

A 1300 500,000

B 2000 800,000

C 2500 950,000

D 3200 1,200,000

X 1800 ?

WebeginbyplottingSizeonthex-axisandPriceonthey-axis,asshowninFigure1-2.

Figure1-2.Plottingpriceversussize

What’sthebestestimateforthepriceofhouseX?

$550,000

$700,000

$1,000,000

Let’sfigureitout.AsshowninFigure1-3,thefourpointsthatweplottedbasedonthedataformanalmoststraightline.Ifwedrawthislinethatbestfitsourdata,wecanfindtherightpointonthelineassociatedwithhouseXonthex-axisandthecorrespondingpointony-axis,whichwillgiveusourpriceestimate.

Figure1-3.Creatingastraightlinetofindpriceestimate

Inthiscase,thatstraightlinerepresentsourmodel—anddemonstratesalinearrelationship.Linearregressionisastatisticalapproachformodelingalinearrelationshipbetweeninputvariables(alsocalledfeature,orindependent,variables)andanoutputvariable(alsocalledatarget,ordependent,variable).Mathematically,thislinearrelationshipcanberepresentedasfollows:

where:

yistheoutputvariable;forexample,thehouseprice.

xistheinputvariable;forexample,sizeinsquarefeet.

β0istheintercept(thevalueofywhenx=0).

β1isthecoefficientforxandtheslopeoftheregressionline(“theaverageincreaseinyassociatedwithaone-unitincreaseinx).

ModelParametersβ0andβ1areknownasthemodelparametersofthislinearregressionmodel.

Whenimplementinglinearregression,thealgorithmfindsthelineofbestfitbyusingthemodelparametersβ0andβ1,suchthatitisascloseaspossibletotheactualdatapoints(minimizingthesumofthesquareddistancesbetweeneachactualdatapointandthelinerepresentingmodelpredictions).

Figure1-4showsthisconceptually.Dotsrepresentactualdatapoints,andthelinerepresentsthemodelpredictions.d1tod9representdistancesbetweendatapointsandthecorrespondingmodelprediction,andDisthesumoftheirsquares.Thelineshowninthefigureisthebest-fitregressionlinethatminimizesD.

Figure1-4.Regression

Asyoucansee,modelparametersareanintegralpartofthemodelanddeterminetheoutcome.Theirvaluesarelearnedfromdatathroughthemodeltrainingprocess.

HyperparametersThereisanothersetofparametersknownashyperparameters.Modelhyperparametersareusedduringthemodeltrainingprocesstoestablishthecorrectvaluesofmodelparameters.Theyareexternaltothemodel,andtheirvaluescannotbeestimatedfromdata.Thechoiceofthehyperparameterswill

affectthedurationofthetrainingandtheaccuracyofthepredictions.Aspartofthemodeltrainingprocess,datascientistsusuallyspecifyhyperparametersbasedonheuristicsorknowledge,andoftentunethehyperparametersmanually.Hyperparametertuningreliesmoreonexperimentalresultsthantheory,andthusthebestmethodtodeterminetheoptimalsettingsistotrymanycombinationsandevaluatetheperformanceofeachmodel.

Simplelinearregressiondoesn’thaveanyhyperparameters.Butvariantsoflinearregression,likeRidgeregressionandLasso,do.Herearesomeexamplesofmodelhyperparametersforvariousmachinelearningalgorithms:

Thekink-nearestneighbors

Thedesireddepthandnumberofleavesinadecisiontree

TheCandsigmainsupportvectormachines(SVMs)

Thelearningrateforaneuralnetworktraining

BestPracticesforMachineLearningProjectsInthissection,weexaminebestpracticesthatmakemachinelearningprojectssuccessful.Thesearepracticaltipsthatmostcompaniesandteamsenduplearningwithexperience.

UnderstandtheDecisionProcessMachinelearning–basedsystemsorprocessesusedatatodrivebusinessdecisions.Hence,itisimportanttounderstandthebusinessproblemthatneedstobesolved,independentoftechnologysolutions—inotherwords,whatdecisionoractionneedstobetakenthatcanbeinformedbydata.Beingclearaboutthedecisionprocessiscritical.Thisstepisalsosometimesreferredtoasmappingabusinessscenario/problemtoadatasciencequestion.

Forourhouse-pricepredictionscenario,thekeybusinessdecisionforahomebuyer,is“ShouldIbuyagivenhouseatthelistedprice?”or“Whatisagoodbidpriceforthishousetomaximizemychanceofwinningthebid?”Thiscouldbemappedtothedatasciencequestion:“Whatisthebestestimateofthehousepricebasedonpastsalesdataofotherhouses?”

http://bit.ly/lasso-proj

Table1-2showsotherreal-worldbusinessscenariosandwhatthisdecisionprocesslookslike.

Table1-2.Understandingadecisionprocess:real-worldscenarios

Businessscenario Keydecision Datasciencequestion

Predictivemaintenance

ShouldIservicethispieceofequipment?

Whatistheprobabilitythisequipmentwillfailwithinthenextxdays?

Energyforecasting

ShouldIbuyorsellenergycontracts?

Whatwillbethelong-/short-termdemandforenergyinaregion?

Customerchurn

WhichcustomersshouldIprioritizetoreducechurn?

Whatistheprobabilityofchurnwithinxdaysforeachcustomer?

Personalizedmarketing WhatproductshouldIofferfirst? Whatistheprobabilitythatcustomerswill

purchaseeachproduct?

Productfeedback

Whichservice/productneedsattention?

Whatisthesocialmediasentimentforeachservice/product?

EstablishPerformanceMetricsAswithanyproject,performancemetricsareimportanttoguideanymachinelearningprojecttowardthepropergoalsandtoensureprogressismade.Afterweunderstandthedecisionprocess,thenextstepistoanswerthesetwokeyquestions:

Howdowemeasureprogresstowardagoalordesiredoutcome?Inotherwords,howdowedefinemetricstoevaluateprogress?

Whatwouldbeconsideredasuccess?Thatis,howdowedefinetargetsforthemetricsdefined?

Forourhouse-pricepredictionexample,weneedametrictomeasurehowcloseourpredictionsaretotheactualprice.Therearequiteafewmetricstochoosefrom.Oneofthemostcommonlyusedmetricsforregressiontasksisroot-mean-squareerror(RMSE).Thisisdefinedasthesquarerootoftheaveragesquareddistancebetweentheactualscoreandthepredictedscore,asshownhere:

Here,y denotesthetruevalueforthei datapoint,andŷ denotesthepredictedvalue.OneintuitivewaytounderstandthisformulaisthatitistheEuclideandistancebetweenthevectorofthetruevaluesandthevectorofthepredictedvalues,averagedbyn,wherenisthenumberofdatapoints.

FocusonTransparencytoGainTrustThereisacommonperceptionthatmachinelearningisablackboxthatjustworksmagically.Itiscriticaltounderstandthatalthoughmodelperformanceasmeasuredbymetricsisimportant,itisevenmoreimportantforustounderstandhowthemodelworks.Withoutthisunderstanding,itisdifficulttotrustthemodelandthereforedifficulttoconvincekeystakeholdersandcustomersofthebusinessvalueofmachinelearningandmachinelearning–basedsystems.

Inheavilyregulatedindustrieslikehealthcareandbanking,whicharerequiredtocomplywithregulation,interpretabilityofmodelsiscritical.Modelinterpretabilityistypicallyrepresentedbyfeatureimportance,whichtellsyouhoweachinputcolumn(orfeature)affectsthemodel’spredictions.Thisallowsdatascientiststoexplainresultingpredictionssothatstakeholderscanseewhichdatapointsaremostimportantinthemodel.

Inourhouse-pricepredictionscenario,ourtrustonthemodelwouldincreaseifthemodel,inadditiontopriceprediction,indicatedkeyinputfeaturesthatcontributedtotheoutput;forexample,housesizeandage.Figure1-5showsfeatureimportanceforourhouse-pricepredictionscenario.Noticethatageandschoolratingarethetopmostfeatures.

jth

j

Figure1-5.Featureimportance

EmbraceExperimentationBuildingagoodmachinelearningmodeltakestime.Aswithothersoftwareprojects,thetricktobecomingsuccessfulinmachinelearningprojectsliesinhowfastwetryoutnewhypotheses,learnfromthem,andkeepevolving.AsshowninFigure1-6,thepathtosuccessisn’tusuallyeasyandrequiresalotofpersistence,duediligence,andfailuresontheway.

Figure1-6.Successisnoteasy.

Herearekeyaspectsofaculturethatvaluesexperimentation:

Bewillingtolearnfromexperiments(successesorfailures).

Sharethelearningwithpeers.

Promotesuccessfulexperimentstoproduction.

Understandthatfailureisavalidoutcomeofanexperiment.

Quicklymoveontothenexthypothesis.

Refinethenextexperiment.

Don’tOperateinaSiloCustomerstypicallyexperiencemachinelearningmodelsthroughapplications.Figure1-7showshowmachinelearningsystemsaredifferentfromtraditionalsoftwaresystems.Thekeydifferenceisthatmachinelearningsystems,inadditiontocodeworkflow,mustalsoconsiderdataworkflow.

Figure1-7.Machinelearningsystemversustraditionalsystems

Afterdatascientistshavebuiltamachinelearningmodelthatissatisfactorytothem,theyhanditofftoanappdeveloperwhointegratesitintothelargerapplicationanddeploysit.Often,anybugsorperformanceissuesgoundiscovereduntiltheapplicationhasalreadybeendeployed.Theresultingfrictionbetweenappdevelopersanddatascientiststoidentifyandfixtherootcausecanbeaslow,frustrating,andexpensiveprocess.

Asmachinelearningentersmorebusiness-criticalapplications,itisincreasinglyclearthatdatascientistsneedtocollaboratecloselywithappdeveloperstobuildanddeploymachinelearning–poweredapplicationsmoreefficiently.Datascientistsarefocusedonthedatasciencelifecycle;namely,dataingestionandpreparation,modelbuilding,anddeployment.Theyarealsointerestedinperiodicallyretrainingandredeployingthemodeltoadjustforfreshlylabeleddata,datadrift,userfeedback,orchangesinmodelinputs.Theappdeveloperisfocusedontheapplicationlifecycle—building,maintaining,andcontinuouslyupdatingthelargerbusinessapplicationthatthemodelispartof.Bothpartiesaremotivatedtomakethebusinessapplicationandmodelworkwelltogethertomeetend-to-endperformance,quality,andreliabilitygoals.

Whatisneededisawaytobridgethedatascienceandapplicationlifecyclesmoreeffectively.Figure1-8showshowthiscollaborationcouldbeenabled.Wewillcoverthisinmoredepthlaterinthebook.

Figure1-8.Appdeveloperanddatascientistworkingtogether

AnIterativeandTime-ConsumingProcessInthissection,wedigdeeperintothemachinelearningprocessbyusingourhouse-pricepredictionexample.Westartedwithhousesizeastheonlyinput,andwesawtherelationshipbetweenhousesizeandhousepricetobelinear.Tocreateagoodmodelthatcanpredictpricesmoreaccurately,weneedtoexplore

goodinputfeatures,selectthebestalgorithm,andtunehyperparametervalues.But,howdoyouknowwhichfeaturesaregood,andwhichalgorithmandhyperparametervalueswilldothebest?Thereisnosilverbullethere;wewillneedtotryoutdifferentcombinationsoffeatures,algorithms,andhyperparametervalues.Let’stakealookateachofthesethreestepsandthenseehowtheyapplytoourhouse-pricepredictionproblem.

FeatureEngineeringFeatureengineeringistheprocessofusingourknowledgeofthedatatocreatefeaturesthatmakemachinelearningalgorithmswork.AsshowninFigure1-9,thisinvolvesfoursteps.

Figure1-9.Featureengineering

First,weacquiredata—collectthedatawithallofthesepossibleinputvariables/featuresandgetittoausablestate.Mostreal-worlddatasetsarenotclean,andneedworktogetthedatatoalevelofqualitybeforeusingit.Thiscaninvolvethingssuchasfixingmissingvalues,removinganomaliesandpossiblyincorrectdata,andensuringthedatadistributionisrepresentative.

Nextyou’llneedtogeneratefeatures:exploregeneratingmorefeaturesfromavailabledata.Thisistypicallyusefulwhendealingwithtextdataortime-seriesdata.Text-relatedfeaturescouldbeassimpleasn-gramsandcountvectorizationorasadvancedassentimentfromreviewtext.Similarly,time-relatedfeaturescouldbeassimpleasmonthandweek-index-of-yearorascomplexastime-basedaggregations.Theseadditionalfeaturesgeneratedcanprovehelpfulinimprovingaccuracyofthemodel.

Withthiscomplete,you’llneedtotransformthedatatomakeitsuitableformachinelearning.Often,machinelearningalgorithmsrequirethatdatabe

preparedinspecificwaysbeforefittingamachinelearningmodel.Forinstance,manysuchalgorithmscannotoperateoncategoricaldatadirectly,andrequireallinputvariablesandoutputvariablesbenumeric.Acategoricalvariableisavariablethatcantakeononeofalimited,andusuallyfixed,numberofpossiblevalues.Examplesofthesevariablesincludecolor(red,blue,green,etc.),country(UnitedStates,India,China,etc.),andbloodgroup(A,B,O,AB).Categoricalvariablesmustbeconvertedtoanumericalform,whichistypicallydonebyusingintegerencodingorone-hotencodingtechniques.

Thefinalstepisfeatureselection:choosingasubsetoffeaturestotrainthemodelon.Whyisthisnecessary?Whynottrainthemodelwiththefullsetoffeatures?Featureselectionidentifiesandremovestheunneeded,irrelevant,andredundantattributesfromdatathatdon’tcontribute,orcaninfactdecrease,themodel’saccuracy.Theobjectiveoffeatureselectionisthreefold:

Improvemodelaccuracy

Improvemodeltrainingtime/cost

Provideabetterunderstandingoftheunderlyingprocessoffeaturegeneration

NOTEFeatureengineeringstepsarecriticalfortraditionalmachinelearningbutnotsomuchfordeeplearning,becausefeaturesareautomaticallygenerated/inferredthroughthedeeplearningnetwork.

Webeganwithasinglefeature:housesize.Butweknowthatthepriceofahouseisdependentnotonlyonsize,butalsoonothercharacteristics.Whatotherinputfeaturescouldinfluencehouseprice?Althoughsizemightbeoneofthemostimportantinputs,herearefewmoreworthconsidering:

Zipcode

Yearbuilt

Lotsize

https://oreil.ly/4YTwv

Schools

Numberofbedrooms

Numberofbathrooms

Numberofgaragestalls

Amenities

AlgorithmSelectionAfterwehavechosenagoodsetoffeatures,thenextstepistodeterminethecorrectalgorithmforthemodel.Forthedatawehave,asimplelinearregressionmodelmightseemtowork.Butrememberthatwehaveonlyafewdatapoints(fourhouseswithprice)—smallenoughtoberepresentativeandsmallenoughformachinelearning.Also,linearregressionassumesalinearrelationbetweeninputfeaturesandtargetvariable.Aswecollectmoredatapoints,linearregressionmightnotremainmostrelevant,andwewillbemotivatedtoexploreothertechniques(algorithms)dependingontrendsandpatternsindata.

HyperparameterTuningAsdiscussedearlierinthischapter,hyperparametersplayakeyroleinmodelaccuracyandtrainingperformance.Hence,tuningthemisacriticalstepingettingtoagoodmodel.Becausedifferentalgorithmshavedifferentsetsofhyperparameters,thisstepoftuninghyperparametersaddstothecomplexityoftheend-to-endprocess.

TheEnd-to-EndProcessWiththatbasicunderstandingoffeatureengineering,algorithmselection,andhyperparametertuning,let’sgostepbystepthroughourhouse-pricepredictionproblem.

Let’sbeginwithSize,Lotsize,andYearbuiltfeaturesandGradientBoostedtreeswithspecifichyperparametervalues,asshowninFigure1-10.Theresultingmodelis30%accurate.Butwewanttodobetterthanthat.

Figure1-10.Machinelearningprocess:step1

Togetunderway,wetrydifferentvaluesofhyperparametersforthesamesetoffeaturesandalgorithm.Ifthatdoesn’timproveaccuracyofthemodeltoasatisfactorylevel,wetrydifferentalgorithms,andifthatdoesn’thelpeither,weaddmorefeatures.Figure1-11showsonesuchintermediatestate,withSchooladdedasafeatureandthek-nearestneighbors(KNN)algorithmused.Theresultingmodelis50%accuratebutstillnotgoodenough,sowecontinuethisprocessandtrydifferentcombinations.

Figure1-11.Machinelearningprocess:intermediatestate

Aftermultipleiterationsoftryingoutdifferentcombinationsoffeatures,algorithms,andhyperparametervalues,weendupwithamodelthatmeetsourcriteria,asshowninFigure1-12.

Figure1-12.Machinelearningprocess:bestmodel

Asyoucansee,thisisaniterativeandtime-consumingprocess.Toputthisinperspective:ifthereare10features,thereareatotalof2 (1,024)waystoselectfeatures.Ifwetryfivealgorithms,andassumingeachhasanaverageoffivehyperparameters,wearelookingatatotalof1,024×5×5=25,600iterations!

Figure1-13showsthescikit-learncheatsheetdemonstratingthatchoosingtheproperalgorithmcouldbeacomplexprobleminitself.Nowimagineaddingfeatureengineeringandhyperparametertuningontopofit.Asaresult,ittakesdatascientistsanywherefromacoupleofweekstomonthstoarriveatagoodmodel.

Figure1-13.Scikit-learnalgorithmcheatsheet(source:https://oreil.ly/xUZbU)

10

https://oreil.ly/xUZbU

GrowingDemandDespitethecomplexityofthemodel-buildingprocess,demandformachinelearninghasskyrocketed.Mostorganizationsacrossallindustriesaretryingtousedataandmachinelearningtogainacompetitiveadvantage—infusingintelligenceintotheirproductsandprocessestodelightcustomersandamplifybusinessimpact.Figure1-14showsthevarietyofreal-worldbusinessproblemsbeingsolvedusingmachinelearning.

Figure1-14.Real-worldbusinessproblemsusingmachinelearning

Asaresult,thereishugedemandformachinelearning–relatedjobs.Figure1-15showsthepercentagegrowthinvariousjobpostingsfrom2015to2018.

Figure1-15.Growthinmachinelearning–relatedjobs

AndFigure1-16showstheexpectedrevenuefromenterpriseapplicationsusingmachinelearningandartificialintelligencegrowingastronomically.

Figure1-16.Machinelearning/artificialintelligencerevenueprojections

ConclusionInthischapter,youlearnedsomeofthebestpracticesthatsuccessfulmachinelearningprojectshaveincommon.Wediscussedthattheprocessofbuildingagoodmachinelearningmodelisiterativeandtime-consuming,resultingindatascientistsrequiringanywherefromacoupleofweekstomonthstobuildagoodmodel.Atthesametime,demandformachinelearningisgrowingrapidlyandisexpectedtoskyrocket.

Tobalancethissupply-versus-demandproblem,thereneedstobeabetterwaytoshortenthetimeittakestobuildmachinelearningmodels.Cansomeofthestepsinthatworkflowbeautomated?Absolutely!AutomatedMachineLearningisoneofthemostimportantskillsthatsuccessfuldatascientistsneedtohaveintheirtoolboxforimprovedproductivity.

Inthefollowingchapterswe’llgodeeperintoAutomatedMachineLearning.We

willexplorewhatitis,howtogetstarted,andhowitisbeingusedinreal-worldapplicationstoday.

Practical Automated Machine · —Parashar. Foreword I vividly remember my first undergraduate...

Documents

Transcript of Practical Automated Machine · —Parashar. Foreword I vividly remember my first undergraduate...