Lecture 08: Principles of Parallel Algorithm Design
Transcript of Lecture 08: Principles of Parallel Algorithm Design
Lecture08:PrinciplesofParallelAlgorithmDesign
ConcurrentandMul<coreProgrammingCSE436/536
DepartmentofComputerScienceandEngineeringYonghongYan
[email protected]/~yan
1
Lastlecture:AlgorithmsandConcurrency
• IntroducAontoParallelAlgorithms– TasksandDecomposiAon– ProcessesandMapping
• DecomposiAonTechniques– RecursiveDecomposiAon(divide-conquer)– DataDecomposiAon(input,output,input+output,intermediate)
• Termsandconcepts– Taskdependencygraph,taskgranularity,degreeofconcurrency– TaskinteracAongraph,criAcalpath
• Examples:– DensevectoraddiAon,matrixvectorproduct– Densematrixmatrixproduct– Databasequery– Quicksort,MIN
2
Today’slecture
• Decomposi<onTechniques-con<nued– ExploratoryDecomposiAon– HybridDecomposiAon
Mappingtaskstoprocesses/cores/CPU/PEs• Characteris<csofTasksandInterac<ons
– TaskGeneraAon,Granularity,andContext– CharacterisAcsofTaskInteracAons
• MappingTechniquesforLoadBalancing– StaAcandDynamicMapping
• MethodsforMinimizingInterac<onOverheads• ParallelAlgorithmDesignModels
3
ExploratoryDecomposi<on
4
• DecomposiAonisfixed/staAcfromthedesign– Dataandrecursive
• ExploraAon(search)ofastatespaceofsoluAons– problemdecomposiAonreflectsshapeofexecuAon– Goeshand-in-handwithitsexecuAon
• Examples– discreteopAmizaAon,e.g.0/1integerprogramming– theoremproving– gameplaying
ExploratoryDecomposi<on:Example
5
Solvea15puzzle• Sequenceofthreemovesfromstate(a)tofinalstate(d)
• Fromanarbitrarystate,mustsearchforasoluAon
ExploratoryDecomposi<on:Example
6
Solvinga15puzzle• Search
– generatesuccessorstatesofthecurrentstate– exploreeachasanindependenttask
ExploratoryDecomposi<onSpeedupSolvea15puzzle
• ThedecomposiAonbehavesaccordingtotheparallelformulaAon– Maychangetheamountofworkdone
7Execu<onterminatewhenasolu<onisfound
Specula<veDecomposi<on
8
• Dependenciesbetweentasksarenotknowna-priori.– ImpossibletoidenAfyindependenttasks
• Twoapproaches– ConservaAveapproaches,whichidenAfyindependenttasks
onlywhentheyareguaranteedtonothavedependencies• Mayyieldlialeconcurrency
– OpAmisAcapproaches,whichscheduletasksevenwhentheymaypotenAallybeinter-dependent• Roll-backchangesincaseofanerror
Specula<veDecomposi<on:Example
9
Discreteeventsimula<on• CentralizedAme-orderedeventlist
– yougetupàgetreadyàdrivetoworkàworkàeatlunchàworksomemoreàdrivebackàeatdinneràandsleep
• SimulaAon– extractnexteventinAmeorder– processtheevent– ifrequired,insertneweventsintotheeventlist
• OpAmisAceventscheduling– assumeoutcomesofallpriorevents– speculaAvelyprocessnextevent– ifassumpAonisincorrect,rollbackitseffectsandconAnue
Specula<veDecomposi<on:Example
10
Simula<onofanetworkofnodes• Simulatenetworkbehaviorforvariousinputandnodedelays
– Theinputaredynamicallychanging• Thustaskdependencyisunknown
• SpeculateexecuAon:tasks’input– Correct:parallelism– Incorrect:rollbackandredo
Specula<vevsExploratory
• ExploratorydecomposiAon– TheoutputofmulApletasksfromabranchisunknown– Parallelprogramperformmore,lessorsameamountofwork
asserialprogram• SpeculaAve
– TheinputatabranchleadingtomulApleparalleltasksisunknown
– Parallelprogramperformmoreorsameamountofworkastheserialalgorithm
11
HybridDecomposi<ons
12
Usemul<pledecomposi<ontechniquestogether• OnedecomposiAonmaybenotopAmalforconcurrency
– QuicksortrecursivedecomposiAonlimitsconcurrency(Why?)
• CombinedrecursiveanddatadecomposiAonforMIN
Today’slecture
• DecomposiAonTechniques-conAnued– ExploratoryDecomposiAon– HybridDecomposiAonMappingtaskstoprocesses/cores/CPU/PEs
• CharacterisAcsofTasksandInteracAons– TaskGeneraAon,Granularity,andContext– CharacterisAcsofTaskInteracAons
• MappingTechniquesforLoadBalancing– StaAcandDynamicMapping
• MethodsforMinimizingInteracAonOverheads• ParallelAlgorithmDesignModels
13
Characteris<csofTasks
14
• Theory– DecomposiAon:toparallelizetheoreAcally
• Concurrencyavailableinaproblem• PracAce
– TaskcreaAons,interacAonsandmappingtoPEs.• RealizingconcurrencypracAcally
– CharacterisAcsoftasksandtaskinteracAons• Impactchoiceandperformanceofparallelism
• Characteris<csoftasks– Taskgenera<onstrategies– Tasksizes(theamountofwork,e.g.FLOPs)– Sizeofdataassociatedwithtasks
TaskGenera<on
15
• StaActaskgeneraAon– Concurrenttasksandtaskgraphknowna-priori(beforeexecuAon)– TypicallyusingrecursiveordatadecomposiAon– Examples
• MatrixoperaAons• Graphalgorithms• ImageprocessingapplicaAons• Otherregularlystructuredproblems
• DynamictaskgeneraAon– ComputaAonsformulateconcurrenttasksandtaskgraphonthefly
• Notexplicitapriori,thoughhigh-levelrulesorguidelinesknown– TypicallybyexploratoryorspeculaAvedecomposiAons.
• AlsopossiblebyrecursivedecomposiAon,e.g.quicksort– Aclassicexample:gameplaying
• 15puzzleboard
TaskSizes/Granularity
16
• TheamountofworkàamountofAmetocomplete– E.g.FLOPs,#memoryaccess
• Uniform:– OlenbyevendatadecomposiAon,i.e.regular
• Non-uniform– Quicksort,thechoiceofpivot
SizeofDataAssociatedwithTasks
17
• Maybesmallorlargecomparedtothetasksizes– Howrelevanttotheinputand/oroutputdatasizes– Example:
• size(input)<size(computa<on),e.g.,15puzzle• size(input)=size(computa<on)>size(output),e.g.,min• size(input)=size(output)<size(computa<on),e.g.,sort
• Consideringtheeffortstoreconstructthesametaskcontext– smalldata:smallefforts:taskcaneasilymigratetoanother
process– largedata:largeefforts:Aesthetasktoaprocess
• ContextreconstrucAngvscommunicaAng– Itdepends
Characteris<csofTaskInterac<ons
• AspectsofinteracAons– What:shareddataorsynchronizaAons,andsizesofthemedia– When:theAming– Who:withwhichtask(s),andoveralltopology/paaerns– DoweknowdetailsoftheabovethreebeforeexecuAon– How:involveoneorboth?
• TheimplementaAonconcern,implicitorexplicitOrthogonalclassificaAon• StaAcvs.dynamic• Regularvs.irregular• Read-onlyvs.read-write• One-sidedvs.two-sided
18
Characteris<csofTaskInterac<ons
• AspectsofinteracAons– What:shareddataorsynchronizaAons,andsizesofthemedia– When:theAming– Who:withwhichtask(s),andoveralltopology/paaerns– DoweknowdetailsoftheabovethreebeforeexecuAon– How:involveoneorboth?
• StaAcinteracAons– PartnersandAming(andelse)areknowna-priori– RelaAvelysimplertocodeintoprograms.
• DynamicinteracAons– TheAmingorinteracAngtaskscannotbedetermineda-priori.– Hardertocode,especiallyusingexplicitinteracAon.
19
Characteris<csofTaskInterac<ons
20
• AspectsofinteracAons– What:shareddataorsynchronizaAons,andsizesofthemedia– When:theAming– Who:withwhichtask(s),andoveralltopology/paaerns– DoweknowdetailsoftheabovethreebeforeexecuAon– How:involveoneorboth?
• RegularinteracAons– DefinitepaaernoftheinteracAons
• E.g.ameshorring– CanbeexploitedforefficientimplementaAon.
• IrregularinteracAons– lackwell-definedtopologies– Modeledasagraph
ExampleofRegularSta<cInterac<on
21
Imageprocessingalgorithms:dithering,edgedetec<on• NearestneighborinteracAonsona2Dmesh
ExampleofIrregularSta<cInterac<on
22
Sparsematrixvectormul<plica<on
Characteris<csofTaskInterac<ons
23
• AspectsofinteracAons– What: shared data or synchronizaAons, and sizes of the
media
• Read-onlyinteracAons– Tasksonlyreaddataitemsassociatedwithothertasks
• Read-writeinteracAons– Read,aswellasmodifydataitemsassociatedwithothertasks.– Hardertocode
• RequireaddiAonalsynchronizaAonprimiAves– toavoidread-writeandwrite-writeorderingraces
Shareddata
T2T1 write
read
Characteris<csofTaskInterac<ons
24
• AspectsofinteracAons– What:shareddataorsynchronizaAons,andsizesofthemedia– When:theAming– Who:withwhichtask(s),andoveralltopology/paaerns– DoweknowdetailsoftheabovethreebeforeexecuAon– How:involveoneorboth?
• TheimplementaAonconcern,implicitorexplicit• One-sided
– iniAated&completedindependentlyby1of2interacAngtasks• GETandPUT
• Two-sided– bothtaskscoordinateinaninteracAon
• SEND+RECV
Today’slecture
• DecomposiAonTechniques-conAnued– ExploratoryDecomposiAon– HybridDecomposiAon
• CharacterisAcsofTasksandInteracAons– TaskGeneraAon,Granularity,andContext– CharacterisAcsofTaskInteracAons
• MappingTechniquesforLoadBalancing– StaAcandDynamicMapping
• MethodsforMinimizingInteracAonOverheads• ParallelAlgorithmDesignModels
25
MappingTechniques
26
• Parallelalgorithmdesign– Programdecomposed– CharacterisAcsoftaskandinteracAonsidenAfied
Assignlargeamountofconcurrenttaskstoequalorrela<velysmallamountofprocessesforexecu<on• Thougho^enwedo1:1mapping
MappingTechniques
27
• Goalofmapping:minimizeoverheads– Thereiscosttodoparallelism
• Interac<onsandidling(serializa<on)
• ContradicAngobjecAves:interacAonsvsidling– Idling(serializaAon)ñ:insufficientparallelism– InteracAonsñ:excessiveconcurrency
– E.g.AssigningallworktooneprocessortriviallyminimizesinteracAonattheexpenseofsignificantidling.
MappingTechniquesforMinimumIdling
28
• ExecuAon:alternaAngstagesofcomputaAonandinteracAon
• Mappingmustsimultaneouslyminimizeidlingandloadbalance– Idlingmeansnotdoingusefulwork– Loadbalance:doingthesameamountofwork
• Merelybalancingloaddoesnotminimizeidling
Apoormapping,50%waste
MappingTechniquesforMinimumIdling
Sta<cordynamicmapping• StaAcMapping
– Tasksaremappedtoprocessesa-prior– NeedagoodesAmateoftasksizes– OpAmalmappingmaybeNPcomplete
• DynamicMapping– TasksaremappedtoprocessesatrunAme– Because:
• TasksaregeneratedatrunAme• Theirsizesarenotknown.
• Otherfactorsdeterminingthechoiceofmappingtechniques– thesizeofdataassociatedwithatask– thecharacterisAcsofinter-taskinteracAons– eventheprogrammingmodelsandtargetarchitectures
29
SchemesforSta<cMapping
• MappingsbasedondatadecomposiAon– Mostly1-1mapping
• MappingsbasedontaskgraphparAAoning• Hybridmappings
30
MappingsBasedonDataPar<<oning
31
• ParAAonthecomputaAonusingacombinaAonof– DatadecomposiAon– The``owner-computes''rule
Example:1-Dblockdistribu5onof2-Ddensematrix 1-1mappingoftask/dataandprocess
BlockArrayDistribu<onSchemes
32
Mul<-dimensionalBlockdistribu<on
Ingeneral,higherdimensiondecomposiAonallowstheuseoflarger#ofprocesses.
BlockArrayDistribu<onSchemes:Examples
Mul<plyingtwodensematrices:A*B=C• ParAAontheoutputmatrixCusingablockdecomposiAon
– Loadbalance:EachtaskcomputethesamenumberofelementsofC• Note:eachelementofCcorrespondstoasingledotproduct
– ThechoiceofprecisedecomposiAon:1-D(row/col)or2-D• DeterminedbytheassociatedcommunicaAonoverhead
33
BlockDistribu<onandDataSharingforDenseMatrixMul<plica<on
34
X=
AXB=C
P0P1P2P3
X=
AXB=C
P0P1P2P3
• Row-based1-D
• Column-based1-D
• Row/Col-based2-D
CyclicandBlockCyclicDistribu<ons
• ConsiderablockdistribuAonforLUdecomposiAon(GaussianEliminaAon)– Theamountofcomputa<onperdataitemvaries– Blockdecomposi<onwouldleadtosignificantloadimbalance
35
LUFactoriza<onofaDenseMatrix
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
36
AdecomposiAonofLUfactorizaAoninto14tasks
BlockDistribu<onforLU
No<cethesignificantloadimbalance
37
BlockCyclicDistribu<ons
• VariaAonoftheblockdistribuAonscheme– ParAAonanarrayintomanymoreblocks(i.e.tasks)thanthe
numberofavailableprocesses.– Blocksareassignedtoprocessesinaround-robinmannerso
thateachprocessgetsseveralnon-adjacentblocks.– N-1mappingoftaskstoprocesses
• Usedtoalleviatetheload-imbalanceandidlingproblems.
38
Block-CyclicDistribu<onforGaussianElimina<on
39
• AcAvesubmatrixshrinksaseliminaAonprogresses• Assigningblocksinablock-cyclicfashion
– EachPEsreceivesblocksfromdifferentpartsofthematrix– Inonebatchofmapping,thePEdoingthemostwillmost
likelyreceivetheleastinthenextbatch
Block-CyclicDistribu<on
• AcyclicdistribuAon:aspecialcasewithblocksize=1• AblockdistribuAon:aspecialcasewithblocksize=n/p• nisthedimensionofthematrixandpisthe#ofprocesses.
40
BlockPar<<oningandRandomMapping
Sparsematrixcomputa<ons• Loadimbalanceusingblock-cyclicparAAoning/mapping
– morenon-zeroblockstodiagonalprocessesP0,P5,P10,andP15thanothers
– P12getsnothing
41
BlockPar<<oningandRandomMapping
42
GraphPar<<oningBasedDataDecomposi<on
• Array-basedparAAoningandstaAcmapping– Regulardomain,i.e.rectangular,mostlydensematrix– StructuredandregularinteracAonpaaerns– QuiteeffecAveinbalancingthecomputaAonsandminimizing
theinteracAons
• Irregulardomain– Sparsmatrix-related– NumericalsimulaAonsofphysicalphenomena
• Car,water/bloodflow,geographic• ParAAontheirregulardomainsoasto
– Assignequalnumberofnodestoeachprocess– MinimizingedgecountoftheparAAon.
43
Par<<oningtheGraphofLakeSuperior
RandomParAAoning
ParAAoningforminimumedge-cut.
44
• EachmeshpointhasthesameamountofcomputaAon– Easyforloadbalancing
• Minimizeedges• OpAmalparAAonisanNP-complete– UseheurisAcs
MappingsBasedonTaskPari<oning
• SchemesforStaAcMapping– MappingsbasedondataparAAoning
• Mostly1-1mapping– MappingsbasedontaskgraphparAAoning– Hybridmappings
• DataparAAoning– DatadecomposiAonandthen1-1mappingoftaskstoPEs
Par<<oningagiventask-dependencygraphacrossprocesses• AnopAmalmappingforageneraltask-dependencygraph
– NP-completeproblem.• ExcellentheurisAcsexistforstructuredgraphs.
45
MappingaBinaryTreeDependencyGraph
46
Mappingdependencygraphofquicksorttoprocessesinahypercube
• Hypercube:n-dimensionalanalogueofasquareandacube– nodenumbersthatdifferin1bitareadjacent
MappingaSparseGraph
Sparsematrixvectormul<plica<onUsingdataparAAoning
47
MappingaSparseGraph
Sparsematrixvectormul<plica<onUsingtaskgraphparAAoning
48
13itemstocommunicate
Process0 0,4,5,8
Process1 1,2,3,7
Process2 6,9,10,11
Hierarchical/HybridMappings
• Asinglemappingisinadequate.– E.g.taskgraphmappingofthebinarytree(quicksort)cannot
usealargenumberofprocessors.• Hierarchicalmapping
– Taskgraphmappingatthetoplevel– DataparAAoningwithineachlevel.
49
Today’slecture
• DecomposiAonTechniques-conAnued– ExploratoryDecomposiAon– HybridDecomposiAon
• CharacterisAcsofTasksandInteracAons– TaskGeneraAon,Granularity,andContext– CharacterisAcsofTaskInteracAons
• MappingTechniquesforLoadBalancing– StaAc– DynamicMapping
• MethodsforMinimizingInteracAonOverheads• ParallelAlgorithmDesignModels
50
SchemesforDynamicMapping
• Alsoreferredtoasdynamicloadbalancing– LoadbalancingistheprimarymoAvaAonfordynamic
mapping.• Dynamicmappingschemescanbe
– Centralized– Distributed
51
CentralizedDynamicMapping
• Processesaredesignatedasmastersorslaves– Workers(slaveispoliAcallyincorrect)
• Generalstrategies– Masterhaspooloftasksandascentraldispatcher– Whenonerunsoutofwork,itrequestsfrommasterformorework.
• Challenge– Whenprocess#increases,mastermaybecometheboaleneck.
• Approach– Chunkscheduling:aprocesspicksupmulApletasksatonce– Chunksize:
• Largechunksizesmayleadtosignificantloadimbalancesaswell• SchemestograduallydecreasechunksizeasthecomputaAonprogresses.
52
DistributedDynamicMapping
• Allprocessesarecreatedequal– Eachcansendorreceiveworkfromothers
• Alleviatestheboaleneckincentralizedschemes.• FourcriAcaldesignquesAons:
– howaresendingandreceivingprocessespairedtogether– whoiniAatesworktransfer– howmuchworkistransferred– whenisatransfertriggered?
• AnswersaregenerallyapplicaAonspecific.
• Workstealing
53
Today’slecture
• DecomposiAonTechniques-conAnued– ExploratoryDecomposiAon– HybridDecomposiAon
• CharacterisAcsofTasksandInteracAons– TaskGeneraAon,Granularity,andContext– CharacterisAcsofTaskInteracAons
• MappingTechniquesforLoadBalancing– StaAc– DynamicMapping
• MethodsforMinimizingInteracAonOverheads• ParallelAlgorithmDesignModels
54
MinimizingInterac<onOverheads
Rulesofthumb• Maximizedatalocality
– Wherepossible,reuseintermediatedata– RestructurecomputaAonsothatdatacanbereusedinsmaller
Amewindows.• Minimizevolumeofdataexchange
– parAAoninteracAongraphtominimizeedgecrossings• MinimizefrequencyofinteracAons
– MergemulApleinteracAonstoone,e.g.aggregatesmallmsgs.• MinimizecontenAonandhot-spots
– Usedecentralizedtechniques– Replicatedatawherenecessary
55
MinimizingInterac<onOverheads(con<nued)
Techniques• OverlappingcomputaAonswithinteracAons
– Usenon-blockingcommunicaAons– MulAthreading– Prefetchingtohidelatencies.
• ReplicaAngdataorcomputaAonstoreducecommunicaAon• UsinggroupcommunicaAonsinsteadofpoint-to-pointprimiAves.
• OverlapinteracAonswithotherinteracAons.
56
Today’slecture
• DecomposiAonTechniques-conAnued– ExploratoryDecomposiAon– HybridDecomposiAon
• CharacterisAcsofTasksandInteracAons– TaskGeneraAon,Granularity,andContext– CharacterisAcsofTaskInteracAons
• MappingTechniquesforLoadBalancing– StaAc– DynamicMapping
• MethodsforMinimizingInteracAonOverheads• ParallelAlgorithmDesignModels
57
ParallelAlgorithmModels
• Waysofstructuringparallelalgorithm– DecomposiAontechniques– Mappingtechnique– StrategytominimizeinteracAons.
• DataParallelModel
– EachtaskperformssimilaroperaAonsondifferentdata– TasksarestaAcally(orsemi-staAcally)mappedtoprocesses
• TaskGraphModel– Usetaskdependencygraphtoguidethemodelforbeaer
localityorlowinteracAoncosts.
58
ParallelAlgorithmModels(con<nued)
• Master-SlaveModel– Master(oneormore)generatework– Dispatchworktoworkers.– DispatchingmaybestaAcordynamic.
• Pipeline/Producer-ComsumerModel– Streamofdataispassedthroughasuccessionofprocesses,
eachofwhichperformsometaskonit– MulAplestreamconcurrently
• HybridModels– ApplyingmulAplemodelshierarchically– ApplyingmulAplemodelssequenAallytodifferentphasesofa
parallelalgorithm.
59
References
• Adaptedfromslides“PrinciplesofParallelAlgorithmDesign”byAnanthGrama
• BasedonChapter3of“IntroducAontoParallelCompuAng”byAnanthGrama,AnshulGupta,GeorgeKarypis,andVipinKumar.AddisonWesley,2003
60