Using Map-Reduce to Teach Parallel Programming...

Post on 16-Aug-2020

4 views 0 download

Transcript of Using Map-Reduce to Teach Parallel Programming...

csinparallel.org

Using Map-Reduce to Teach Parallel Programming

Concepts

Dick Brown, St. Olaf CollegeLibby Shoop, Macalester College

Joel Adams, Calvin College

csinparallel.org

Workshopsite

CSinParallel.org->Workshops->WMRWorkshopSeealsoworkshophandout

csinparallel.org

Introductorycomments

– Roleofundergraduateresearchers:Therewouldbenoworkshopwithoutthem!

– ThankstoAmazonWebServicesforprovidingcreditstohostourWMRinstance

– Disclaimer:Wearenotproposingmap-reduceastheonlyapproachtointroducingparallelism,concurrency

– Avalue:Honorthyneighbor'scurricularapproach

csinparallel.org

Goals

–  Introducemap-reducecompuLng,usingtheWebMapReduce(WMR)simplifiedinterfacetoHadoop

• Whyusemap-reduceinthecurriculum?

– Hands-onexerciseswithWMRforfoundaLoncourses

csinparallel.org

Goals

Part1–IntroducLon– Map-reducecompuLng,andtheWebMapReduce(WMR)simplifiedinterfacetoHadoop

– Hands-onexerciseswithWMRforfoundaLoncourses

Part2–TeachingwithWMR– Whyusemap-reduceinthecurriculum?– UseofWMRforintermediateandadvancedcourses– Hands-onexercisesformoreadvanceduse

Part3(opLonal)–What’sunderthehood?

csinparallel.org

SneakPreview:Materialsavailable

(Incaseyoualreadyknowyourmap-reduce…)

•  CSinParallelmodule:Map-reduceCompu;ngforIntroductoryStudentsusingWebMapReduce– Seecsinparallel.org

csinparallel.org

Part1:IntroducLontoMap-ReduceCompuLngandWMR

csinparallel.org

IntroducLontoMap-ReduceCompuLng

csinparallel.org

History

– ThecomputaLonalmodelofusingmapandreduceoperaLonswasdevelopeddecadesago,forLISP

– GoogledevelopedMapReducesystemforsearchengine,published(DeanandGhemawat,2004)

– Yahoo!createdHadoop,anopen-sourceimplementaLon(underApache);Javamappersandreducers

csinparallel.org

Map-Reduce:The2-minuteoverview

Whatifyouwantedtocountthefrequenciesofallwords

in1,000,000books?

csinparallel.org

Map-Reduce:The2-minuteoverview

Whatifyouwantedtocountthefrequenciesofallwords

in1,000,000books?

1.  Breakupthelinesoftext:generateonelabelledpieceperword

•  Usethatwordaslabel;value1foreachpiece

2.  Groupthepiecesaccordingtolabel(word)3.  Addupthe1’sineachgroup

csinparallel.org

Map-ReduceConcept

csinparallel.org

Themap-reducecomputaLonalmodel•  Map-reduceisatwo-stageprocesswitha"shuffletwist"

betweenthestages.

•  StagesarecontrolledbyfuncLons:mapper(),reducer()

csinparallel.org

Themap-reducecomputaLonalmodel

•  mapper()funcLon:– Argumentisonelineofinputfromafile–  Produces(key,value)pairs

•  Example:word-countmapper()"thecatinthehat”-->[mapperforthisline]("the","1"),("cat","1"),("in","1"),("the","1"),("hat","1")

csinparallel.org

Themap-reducecomputaLonalmodel

•  Shufflestage:– groupallmappers’(key,value)pairstogetherthathavethesamekey,andfeedeachgrouptoitsowncallofreduce()

–  Input:all(key,value)pairsfromallmappers– Output:Thosepairsrearranged,senttocallsofreduce()accordingtokey

•  Note:Shufflealsosorts(opLmizaLon)

csinparallel.org

Themap-reducecomputaLonalmodel

•  reducer()funcLon:– Receivesallkey-valuepairsforonekey– Producesanaggregateresult

•  Example:word-countreducer()("the","1"),("the","1")-->[reducerfor"the"]("the","2")

csinparallel.org

Themap-reducecomputaLonalmodel

–  Inmap-reduce,aprogrammercodesonlytwofuncLons(plusconfiginformaLon)

•  Amodelforfutureparallel-programmingframeworks

– Underlyingmap-reducesystemreusescodefor•  ParLLoningthedataintochunksandlines,•  Runsmappers/reducerswherethechunksarelocal•  Movingdatabetweenmappersandreducers•  Auto-recoveringfromanycrashesthatmayoccur•  ...

– OpLmized,Distributed,Fault-tolerant,Scalable

csinparallel.org

Themap-reducecomputaLonalmodel

•  OpLmized,Distributed,Fault-tolerant,Scalable

1.mappers 2.shuffle 3.reducersLocalI/O

GlobalI/O

csinparallel.org

DemoofWMR

cumulus.cs.stolaf.edu/wmr

Intromodule

csinparallel.org

Materialsavailable

•  CSinParallelmodule:Map-reduceCompu;ngforIntroductoryStudentsusingWebMapReduce– Seecsinparallel.org

csinparallel.org

Overviewofsuggestedexercises

Availableonthecsinparallel.orgsite

– Runwordcount(provided),withsmallandlargedata

– Modify,runvariaLonsonwordcount:strippunctuaLon;caseinsensiLve;etc.

– AlternaLveexercises

csinparallel.org

AddiLonalexercises

Beyondyourfirstsimpleexercises,considerexploringthefollowing:•  Variousdatasets

– Note:PleaseavoidlargeGutenberg"groups"forthisworkshop

•  ExtendedsetofexercisesforCS1(textanalysis)

csinparallel.org

Hands-onexploraLonofWMR

csinparallel.org

Part2:TeachingwithWMR

Whymap-reduce?WhyWMR?

TeachingWMRinCS1;inothercourses

csinparallel.org

Whyteachmap-reduce?

csinparallel.org

Whymap-reduceforteachingnoLonsofparallelism/concurrency?

– Concepts:•  dataparallelism;•  taskparallelism;•  locality;•  effectsofscale;•  exampleeffecLveparallelprogrammingmodel;•  distributeddatawithredundancyforfaulttolerance;...

csinparallel.org

Whymap-reduceforteachingnoLonsofparallelism/concurrency?

– Real-World:Hadoopwidelyused– ExciLng:theappealofGoogle,Facebook,etc.– Useful:forappropriateapplicaLons– Powerful:scalabilitytolargeclusters,largedata

csinparallel.org

WhyWMR?

–  Introduceconceptsofparallelism– Lowbarforentry,feasibleforCS1(andbeyond)– CapturetheimaginaLonsofstudents

•  SupportsrapidintroducLonofconceptsofparallelismforeveryCSstudent–  Intromoduledesignedfor1-3daysofclass

csinparallel.org

WebMapReduce(WMR)

– SimplifiedwebinterfaceforHadoopcomputaLons

– Goals:•  StrategicallySimplesuitableforCS1,butnotatoy

•  Configurablewritemappers/reducersinanylanguage

•  AccessiblewebapplicaLon•  MulL-plamorm,front-endandback-end

csinparallel.org

WMRFeatures(Briefly)

– TesLnginterface•  Errorfeedback•  BypassesHadoop--smalldataonly!

–  StudentsenterthefollowinginformaLon:•  choiceoflanguage•  datatoprocess•  definiLonofmapperinthatlanguage•  definiLonofreducerinthatlanguage

csinparallel.org

WMRsysteminformaLon

– Languagescurrentlysupported:Java,C++,Python,Scheme,C,C#,Javascript

•  Rcomingsoon

– Backendstodate:localcluster,AmazonEC2cloudimages

•  Versionlimitsandmorebackendscomingsoon

Moredetailsaboutthesystemin(opLonal)Part3oftheworkshop

csinparallel.org

Teachingmap-reducewithWMRintheintroductorysequence

csinparallel.org

KinestheLcstudentacLvity•  VisualizaLonsofmap-reducecomputaLonsareenoughforsomestudents,butnotall

•  Anin-classac/vitytoactoutthemap/shuffle/reduceprocesshelpsothers

•  Alsohelpful:imagesofclusters;sequenLalversions;contextofwell-knownwebservices

csinparallel.org

WMRinadvancedcourses

Example:PDCelecLve•  CS1module•  Map-reduceprogrammingtechniques

– FeaturesofWMR– Contextforwarding– Structuredvalues;structuredkeys– Mul;-casemappers;mul;-casereducers– Broadcas;ngdatavalues

csinparallel.org

ExamplesforTextProcessingTechniques

•  Combiningdatawithinamapper–  Mapper:Tallycountsofwordsbeforesendingtoreducer

•  ComputaLonallinguisLcs:–  wordsthatareco-located

•  FindandcountpairsExampleIn:thecatinthecathatEmits:1cat|in1in|the1cat|hat2the|cat

•  Usecombiningproceduretofind‘stripes’ExampleIn:thecatandthedogfoughtoverthedogboneEmits:(the,{cat:1,dog:2}

Thanksto:DataIntensiveTextProcessing,byJimmyLinandChrisDyer

csinparallel.org

ApplicaLonideas

•  Examplesintheintroductorymodule•  Bigdatasetspeoplecareabout

•  Especiallyforunstructureddata•  Convenientforcertainkindsofprojects

– E.g.,mostcommonmedicalterminology

csinparallel.org

WMRHands-on,conLnuedModuleexercisesExtendedexercisesetDatasetsavailable/shared/MovieLens2/movieRaLngs/shared/gutenberg/WarAndPeace.txt/shared/gutenberg/CompleteShakespeare.txt

csinparallel.org

Part3:What’sunderthehood

csinparallel.org

AboutWMR

•  WMRanditsarchitecture

•  ObtainingandinstallingWMR– WebMapReduce.sf.com

ClusterHeadNode

WebServer

UserBrowser

UserBrowser

Cluster

csinparallel.orgBasic

Hadoopcomponents•  Internals:

–  Jobmanagement(percluster)– Taskmanagement(percomputaLonnode)

•  Somecomponentsvisibletotheuser:– HadoopAPI–Java,orarbitraryexecutables(“Streaming”)

– HadoopDistributedFileSystem(HDFS)– Supporttools,includinghadoop command– Limitedjobmonitoring…

csinparallel.org

Goals

–  Introducemap-reducecompuLng,usingtheWebMapReduce(WMR)simplifiedinterfacetoHadoop

• Whyusemap-reduceinthecurriculum?– Hands-onexerciseswithWMRforfoundaLoncourses

– UseofWMRforintermediateandadvancedcourses

• What’sunderthehoodwithWMR•  ApeekatHadoop…

– Hands-onexercisesformoreadvanceduse

csinparallel.org

WMRinadvancedcourses

csinparallel.org

InverLng"Chapter1:CallmeIshmael.Some…”"Chapter2:Istuffedashirtortwo...""Chapter3:Enteringthatgable-ended...”

-->[mapper]("call","1"),("me","1"),...,("i","2"),("stuffed”,"2"),...,("entering","3"),...

-->[reducer]

"a""1,1,1,1,...,2,2,2,...""aback""3,7,7,8,...”...

csinparallel.org

Whenismap-reduceappropriate?

•  Massive,unstructuredorirregularlystructured“bigdata”(Terascaleandupward)– Rawtext– Webpages– XML– Unstructuredstreamsofdata

•  Otherapproachesmayfitstructured“bigdata”– Scalabledatabases– Large-scalestaLsLcalapproaches

csinparallel.org

UsingHadoopdirectly(Java)

WordCount.javaexample

csinparallel.org

DirectHadoopExamples

•  Wordcount–  Java

csinparallel.org

QuickquesLons/commentssofar?

csinparallel.org

Hands-on

csinparallel.org

Overviewofsuggestedexercises

– ComputaLonswithMovieLens2data;mulLplemap-reducecycles

– Trafficdataanalysis– NetworkanalysisusingFlixterdata– TheMillionSongdataset

csinparallel.org

Discussion

csinparallel.org

EvaluaLons!

Linksat:CSinParallel.org->Workshops->WMRWorkshop(endofthepage)

csinparallel.org

SomeconsideraLonswithHadoop

– Numbersofmappersandreducers– DFS– Faulttolerance–  I/Oformats

– Note:wehavefurtherslideswithaddiLonalinformaLonabouttheseaspects,foryoutolookatonyourown.

csinparallel.org

DirectHadoopexercisesetup

– Edityourownfiles,locally– scptocluster'sadminnode(oncloud)– sshtocompile,launchjob– Percentageprogressoutputisprovided– MulLplecyclesupportviaDFS–  (Cleanup)

csinparallel.org

AddiLonalDetailsaboutHadoop

csinparallel.org

ThehadoopprojectdocumentaLon

•  h}p://hadoop.apache.org/common/docs/current/index.html

csinparallel.org

Howmanymappers?

•  TheHadoopMap/ReduceframeworkspawnsonemappertaskforeachInputSplitgeneratedbytheInputFormatforthejob.

•  Thenumberofmappersisusuallydrivenbythetotalsizeoftheinputs,thatis,thetotalnumberofblocksoftheinputfiles.

csinparallel.org

Howmanyreducers?

•  ThenumberofreducersforthejobissetbytheuserviaJobConf.setNumReduceTasks(int)

•  Thesizeofyoureventualoutputmaydictatehowmanyreducersyouchoose.

csinparallel.org

HDFS

•  Fault-tolerantdistributedfilesystemmodeleda~ertheGoogleFileSystem– we'vehadstudentsreadtheoriginalGFSpaperinanadvancedcourse

•  h}p://hadoop.apache.org/hdfs/docs/current/index.html

•  NotethesecLonaboutthefilesystemcommandsyoucanrunfromthecommandline:hadoopfs-lsHadoopfs-getor-put

csinparallel.org

HDFSAssumpLonsandGoals

•  Hardwarefailure–  HardwarefailureisthenormratherthantheexcepLon.

•  Streamingdataaccess–  ApplicaLonsthatrunonHDFSneedstreamingaccesstotheirdatasets.TheyarenotgeneralpurposeapplicaLonsthattypicallyrunongeneralpurposefilesystems.

•  LargeDataSets•  Simplecoherencymodel

–  Readmany,writeonce•  MovingcomputaLonsissimplerthanmovingdata•  Portabilityacrossvarioushardware/so~ware

csinparallel.org

csinparallel.org

csinparallel.org

Input/Outputformats

–  InputintomappersareinterpretedusingclassesimplemenLngtheinterfaceInputFormat,andoutputfromreducersareimplementedusingclassesimplemenLngtheinterfaceOutputFormat.

–  InWMR,themapperinputandreduceroutputisperformedwithkey-valuepairs.ThiscorrespondstousingtheclassesKeyValueTextInputFormatandTextOutputFormat.

–  IndirectHadoop,thedefaultinputformatisTextInputFormat,inwhichvaluesarelinesofthefileandkeysareposiLonswithinthatfile.

csinparallel.org

SomefurtherfeaturesofHadoop

– Combiner,anopLmizaLon:performsome"reducLon"duringthemapphase,a~ermapper()andbeforeshuffle

– SorLngcontrol•  Note:hardtosortonsecondarykey

– Threeprogramminginterfaces:Java;pipes(C++);streaming(executables)

csinparallel.org

Pagerankalgorithmideas

•  Originaldata:onewebpageperline

–  mapperproduces("dest","1/kPn")foreachlinkinpagePnwhereklinksappearwithinthatpagePnreducerproduces("dest","weight_0P1P2P2P3P4...")whereweightissumoftheweightsfromkeyvaluepairsemi}edbyP1,P2,...

–  SubsequentmappersandreducersproducerefinedweightsthattakeintoaccountdeeperchainsofpagespoinLngtopages

–  Finalreducerdelivers("dest","weight_k")[dropPns]