Big Data Analytics using Spark - GitHub Pages · •Consult documentation of python, spark etc....
Transcript of Big Data Analytics using Spark - GitHub Pages · •Consult documentation of python, spark etc....
BigDataAnalyticsusingSpark
CSE255/DSE230
Whatis“BigData”?
• 1GB?• 1TB?
• 1PB?• ….
• Weneedadefinitionthatdoesnotchangeovertime.• Moredatathancanfitonasinglework-station.• Communicationdominatescomputation.
“DataScience”vs.“Computerscience”
• Computersciencefocusesonthealgorithm• Requirementsspecifyinputtooutputrelationship (findshortestpath)• Algorithmshouldbecorrectandefficient• Input(data)canbeanythingthatconformstoinputformat.
• DataSciencefocusesonthedata.• Thegoalistounderstand/model/controlthephysicalprocessgeneratingthedata.• Algorithmsareusedbythedatascientisttoidentifypatternsinthedata.• Dataisassumedtoconformtoastatisticalmodel.
Whatisadatascientist?From:DoingDataScience:StraightTalkfromtheFrontlineRachelSchutt&CathyO’Neil
&Communicationskills
Therearemanygoodjobsindatascience
• DataScientist: Oneofthetentopjobsin2016accordingtoForbesandglass-door.• Therearecurrently8446datascienceopeningsintheUS(LinkedIn).• 7000openingsinIndia(naukuri.com),• Medianbasesalaryisaround$116,000peryear(Glassdoor).
Halicioglu graduatedwithabachelor’sdegreeincomputersciencein1996
NickWoodman,FounderofGo-ProWoodmangraduatedfromUCSDinJune1997withaB.Ainvisualartsandaminorincreativewriting.
TheoutputofasinglegoPro
• GoProHeroBlack5:$400.• 120FPS1080p1920X1080• =250Mpixel/seceachpixel3*8bits=6Gbit/sec• Maxcompressedoutputbitrate60Mbit/sec• Compressionbyafactorof100.• 2:14minutes=1GBcompressed.• Imageprocessingrequiresuncompressed•
Processingatthesource
• SupposeyouwantedtouseGoProtomonitoryourfrontdoor.• TheGoProusessophisticatedlossy compressiontoreducedatabyafactorof100.• However,toperformanalysis,yourPCwouldhavetouncompress thedataandthenprocess>40GBperminute.• Youwouldneedabeefycomputer.• Butmostofthetimethereisverylittlechangefromframetoframe,soifchangedetectorisimplementedonthecamera,thereis,mostofthetime,nothingtocommunicate.
Scalingup:Sensornetworks& Smartcities
MatchPointhttps://datascience.sdsc.edu/matchpoint
CSE255/DSE230
• Afuncourse• Notaneasycourse.• WeeklyHW,fromFridaytoFridayexpecttospend~10hoursoneachHW.• Youareexpectedtofigureoutthingsonyourown.
• Consultdocumentationofpython,sparketc.• Brushuponyourlinearalgebra,eigen-vectors,eigen-values,eigen-decomposition.• Seelinearalgebramaterialonwebsite.• Wikipedia
• YouareexpectedtoparticipateinclassandonPiazza.
Whatwillyoulearn?From:DoingDataScience:StraightTalkfromtheFrontlineRachelSchutt&CathyO’Neil
&Communicationskills
PythonSpark
LinearAlgebraPCARegressionClassification
Jupyter NotebooksVisualizationInterpretationBreakdownProblems
Jupyter Notebooks
• Pullthemfromthegithub repository.• Theyareyourmainresource:• ClassSlidesarederivedfromthenotebooks• Code• Explanations• Pointerstoadditionalresources• Exercises
Grading
• HW:50%• Therewillbe9HWassignments,theonewiththelowestgradewillbedroppedfromtheaverage.
• Quiz:10%• EachThursday.Lowestgradedroppedfromaverage.
• BreakdownProblems:10%• Explainedonclasswebpage.
• Final:30%• Yetdodecidewhetherin-classortakehome.
Moredetailsonthewebsite
• Goto• https://mas-dse.github.io/DSE230/