A Mathematical Perspective on Data Science - large (86.7 MB)
Transcript of A Mathematical Perspective on Data Science - large (86.7 MB)
![Page 1: A Mathematical Perspective on Data Science - large (86.7 MB)](https://reader034.fdocuments.net/reader034/viewer/2022042617/626445081f7b530b5d589459/html5/thumbnails/1.jpg)
Copyright©2016SplunkInc.
AMathematicalPerspectiveonDataScience
Dr.TomLaGattaStaffSalesEngineer(previouslyStaffDataScientist)
![Page 2: A Mathematical Perspective on Data Science - large (86.7 MB)](https://reader034.fdocuments.net/reader034/viewer/2022042617/626445081f7b530b5d589459/html5/thumbnails/2.jpg)
AboutMe• MathPhDfromUniversityofArizona
– ”Geodesics ofRandomRiemannianMetrics”w/Janek Wehr– Probability+DifferentialGeometry+FunctionalAnalysis
• PostdocatCourantInstitute@NYU– FinishedGeodesicspaper,published inCommunications inMath.Physics– CollaboratedwithPoliticalScientists onheterogeneousvotingbehavior
• Was:StaffDataScientistatSplunk– HelpedcustomerswithadvancedusecasesinBusinessAnalytics, Internetof
Things,MachineLearning,DataScience
• Now:StaffSalesEngineeratSplunk– Helpingbigbigcustomerssolvebigbigbusinessproblems
![Page 3: A Mathematical Perspective on Data Science - large (86.7 MB)](https://reader034.fdocuments.net/reader034/viewer/2022042617/626445081f7b530b5d589459/html5/thumbnails/3.jpg)
Abstract● Aswithallthings,theprocessofanalyzingdataadmitsamathematicaldescription.Asa
mathematician-turned-data-scientist,Iwilldescribemyapproachtoproblemsolving,andattempttolooselyformalize"stakeholders","usecases","data"and"deliverables"inmathematicallanguagefortheenjoymentofthismostlyacademicaudience.Inparticular,Iwilldescribehowquerylanguagesareinherentlyfunctional,actingasfunctionaltransformationsofDataintoData,whichobeytheusualfunctionalcomposition law.Theprocessofanalyzingdataresultsinaniterativesequenceofqueries,convergingtoafinalquerywhichissatisfactorytotheusecase.Thesequeriesarethenorganizedintodeliverables,whichcanbe"dashboards" (webpageswithvisualizations)or"dataproducts"(withscheduledjobs&analysesrunninginthebackground).Whenthisprocessisdoneright,itresultsintheextractionof"value"forstakeholders,whichcanbemeasuredtangiblyintermsofrevenue,costsorriskmetrics.Sometimesthishasafancynamelike"datascience",butmoreoftenthannot, isjustthenormaloperationalworkofagooddata-savvyIT,Security,TechorBusinessdepartmentinmodernenterprisesandgovernmentalagencies.Therewillbenoproofs,but Iwouldbeveryinterestedtodiscussrigorousapproachestosocialorganization&problemsolvingafterthetalk.
![Page 4: A Mathematical Perspective on Data Science - large (86.7 MB)](https://reader034.fdocuments.net/reader034/viewer/2022042617/626445081f7b530b5d589459/html5/thumbnails/4.jpg)
Agenda• BasicDefinitions:
– Data,Stakeholders,UseCases,Deliverables
• Querylanguagesasfunctionalprogramming– Everyqueryisamapf:Data->Data– Exampleproblemsolving
• PuttingItAllTogether:DoingDataScience– Emphasizeactionable insights– Tieittogethertodeliver”value”tostakeholders
![Page 5: A Mathematical Perspective on Data Science - large (86.7 MB)](https://reader034.fdocuments.net/reader034/viewer/2022042617/626445081f7b530b5d589459/html5/thumbnails/5.jpg)
Copyright©2016SplunkInc.
BasicDefinitions
![Page 6: A Mathematical Perspective on Data Science - large (86.7 MB)](https://reader034.fdocuments.net/reader034/viewer/2022042617/626445081f7b530b5d589459/html5/thumbnails/6.jpg)
WhatisData?• ”Data”isanyinformationalartifactofreal-worldphenomena• A”metric”or”KPI”isanyaggregatefunctionoflow-leveldata• Examples:
– Semi-structuredtimestamped events/metrics– Structuredrelationaldata(rows&columns)– Graphdata(nodes&edges)– ”Unstructured”data(images,video,text)
• Howtomodeldata:– Events:markedpointprocesses×eries(Skorokhod space)– Relationalschema:categories(seeDavidSpivak’swork)– Otherdata:dependsontheusecase,mightneednewdatastructuresto
representit(incl.vectors,graphs,etc.)
![Page 7: A Mathematical Perspective on Data Science - large (86.7 MB)](https://reader034.fdocuments.net/reader034/viewer/2022042617/626445081f7b530b5d589459/html5/thumbnails/7.jpg)
WhatisaStakeholder?• A”stakeholder”isaperson,groupororganizationwhoisinvestedintheoutcomeofaninitiative.
• Example:Acompanybuyssoftware– StakeholderorgsareIT&theBusiness– Individualstakeholders includeIndividual
Contributors,Managers,Directors&Executives.
• ITstakeholdershaveperformancemetrics(num.outages,meantimetoresolution,etc)
• Businessstakeholdershavedifferentmetrics(revenue,cost,risk)• Customerstakeholdersdownstreamalsohavevalue/impactmetrics
![Page 8: A Mathematical Perspective on Data Science - large (86.7 MB)](https://reader034.fdocuments.net/reader034/viewer/2022042617/626445081f7b530b5d589459/html5/thumbnails/8.jpg)
WhatisaStakeholder?(cont.)• Howtomodelstakeholders?
– FollowGameTheoryforinspiration(butdon’tworryabout”equilibrium”)– Createanindexset Iwithallstakeholders.Variousactions,outcomes&
objectiveswillhavesubscripts i basedonstakeholders– E.g.,personi choosesactionai,t attimet– Icanbehierarchical(Personi containedinorgA,soparent(i)=A)– ”Value”modeledbyobjectivefunctions(Ui,n =objective#nforpersoni)– Istheaction”pivotal”fortheoutcome?(ie 𝔼[U|do(a)]>𝔼[U|do(nota)]?)
• Keeptrackofstakeholdersdata:– Mightbehigh-level(email,Powerpoints)– contextiskey– Mightbeindatabases(transactions,customerrecords,ticketsdata)– Mightbegranulareventsdata(webclickstream,logs,mobile,wiredata)
![Page 9: A Mathematical Perspective on Data Science - large (86.7 MB)](https://reader034.fdocuments.net/reader034/viewer/2022042617/626445081f7b530b5d589459/html5/thumbnails/9.jpg)
WhatisaUseCase?• A”usecase”consistsofabusinessproblem,astrategytoalleviatetheproblem,metricstoevaluatetheoutcome,datatopowerasolution,andstakeholderswhoareinvolvedinitsdevelopment.
• UseCase:ProblemForecasting.– Companyhascostlynetwork/systemoutages– IThiresaDataScientisttohelpbuildsolution.– Dataincludes Infrastructure(CPU,Memory),
Operations(OutageReports),Applogs,etc– Metric:costofoutages≈
numoutages*timetoresolution*costoflabor– StakeholdersincludeIT&Business, andimpactedcustomers
![Page 10: A Mathematical Perspective on Data Science - large (86.7 MB)](https://reader034.fdocuments.net/reader034/viewer/2022042617/626445081f7b530b5d589459/html5/thumbnails/10.jpg)
WhatisaDeliverable?• A”deliverable”isathingproducedtosolveausecase
– Caninclude”dashboards”:informationalwebpagesbuiltfromdataqueries– Or”workflows”:notableeventsdeliberatedtooperationsanalysts– Orfull-fledged”dataproduct”:applicationwhich*does*stuffautomatically
• Deliverable:ProblemForecastingSystem– Goal:forecastproblemsbeforetheystart,
deliver”proximaterootcause”toITtoinvestigate– Data:CPU,Memory,Latency,ServiceTickets– Buildmachine learningmodeltocorrelate
InfrastructuredatawithServiceimpact– Applymodeltoincomingevents,create
”predicted(Risk_Score)”– Surfacehigh-riskeventstoITOperations
![Page 11: A Mathematical Perspective on Data Science - large (86.7 MB)](https://reader034.fdocuments.net/reader034/viewer/2022042617/626445081f7b530b5d589459/html5/thumbnails/11.jpg)
Copyright©2016SplunkInc.
QueriesasFunctionalProgramming
![Page 12: A Mathematical Perspective on Data Science - large (86.7 MB)](https://reader034.fdocuments.net/reader034/viewer/2022042617/626445081f7b530b5d589459/html5/thumbnails/12.jpg)
QueryLanguages• Querylanguagesprovideaformulaicapproachtoworkingwithdata• A”query”isastringwhichtellswheretogetthedata,whattodowiththedata,andwheretoputthedata(incl visualizationorDB)
• Mathematically,everyqueryisaFUNCTIONf:Data->Data• Queriescanbecomposed(with|symbol),andanalysisisiterative• Example1:SuccessfulPurchaseActionsfromWebLogs
sourcetype=access_combined action=purchase status=200
• Example2:PlotPurchaseValueasMetricTimeseriessourcetype=access_combined action=purchase status=200| timechart partial=f span=5m sum(price) as value
![Page 13: A Mathematical Perspective on Data Science - large (86.7 MB)](https://reader034.fdocuments.net/reader034/viewer/2022042617/626445081f7b530b5d589459/html5/thumbnails/13.jpg)
1:SuccessfulPurchaseActionsfromWebLogs
13
![Page 14: A Mathematical Perspective on Data Science - large (86.7 MB)](https://reader034.fdocuments.net/reader034/viewer/2022042617/626445081f7b530b5d589459/html5/thumbnails/14.jpg)
2:PlotPurchaseValueasMetricTimeseries
14
Wait!ProblemIdentified!Whyarepurchasesdecreasing?
![Page 15: A Mathematical Perspective on Data Science - large (86.7 MB)](https://reader034.fdocuments.net/reader034/viewer/2022042617/626445081f7b530b5d589459/html5/thumbnails/15.jpg)
3:InvestigateDatabaseErrors
15
![Page 16: A Mathematical Perspective on Data Science - large (86.7 MB)](https://reader034.fdocuments.net/reader034/viewer/2022042617/626445081f7b530b5d589459/html5/thumbnails/16.jpg)
4:PlotDatabaseErrors
16
![Page 17: A Mathematical Perspective on Data Science - large (86.7 MB)](https://reader034.fdocuments.net/reader034/viewer/2022042617/626445081f7b530b5d589459/html5/thumbnails/17.jpg)
5:CorrelateDBproblemswithpurchasevalue
17
![Page 18: A Mathematical Perspective on Data Science - large (86.7 MB)](https://reader034.fdocuments.net/reader034/viewer/2022042617/626445081f7b530b5d589459/html5/thumbnails/18.jpg)
6:SaveasDeliverable,SendtoStakeholders
18
![Page 19: A Mathematical Perspective on Data Science - large (86.7 MB)](https://reader034.fdocuments.net/reader034/viewer/2022042617/626445081f7b530b5d589459/html5/thumbnails/19.jpg)
7:Movetowardproactivemonitoringstance
19
![Page 20: A Mathematical Perspective on Data Science - large (86.7 MB)](https://reader034.fdocuments.net/reader034/viewer/2022042617/626445081f7b530b5d589459/html5/thumbnails/20.jpg)
Copyright©2016SplunkInc.
PuttingItAllTogether:DoingDataScience
![Page 21: A Mathematical Perspective on Data Science - large (86.7 MB)](https://reader034.fdocuments.net/reader034/viewer/2022042617/626445081f7b530b5d589459/html5/thumbnails/21.jpg)
EmphasizeActionable Insights• Avoideye-candyvisualizations
– “Laserbeam”threatdashboardslookcoolbutareuseless
• “Howdoesthishelpmesolvemyproblem?”• Guidetheviewertodrilldown&actquickly
Confusingviz:notactionable Goodviz:actionable
![Page 22: A Mathematical Perspective on Data Science - large (86.7 MB)](https://reader034.fdocuments.net/reader034/viewer/2022042617/626445081f7b530b5d589459/html5/thumbnails/22.jpg)
DoingDataScience• DataScientistsresisteasycharacterization.Abitof:
– Statistician– SoftwareEngineer– Business Analyst– Spockonthebridge
• Manyscalesofaction:– Getyourhandsdirtywhenneeded– Butstepbackandseethebigpicture– Emphasize”actionable insights”throughoutthewholeorg– Alsohavepoliticalcredibilitytosay,”Thisisabaddecision,don’tdoit”
• DataScientistsguideorganizationsthroughbuildingdataproductstosolvebigproblemsanddelivervalue