CSE 412/CS 454/MATH 486 Parallel Numerical Algorithms 5...
Transcript of CSE 412/CS 454/MATH 486 Parallel Numerical Algorithms 5...
CSE 412/CS454/MATH 486Parallel Numerical Algorithms
5. Inter processorCommunicationProf. MichaelT. Heath
Departmentof ComputerScience
Universityof Illinois atUrbana-Champaign
Copyright c
�
2004,MichaelT. Heath– p.1/38
MessageRouting
If messageis sentbetweenprocessorsthatarenotdirectly connected,thenit mustberoutedthroughintermediateprocessors
Messageroutingalgorithmscanbeminimal or nonminimalstaticor dynamicdeterministicor randomizedcircuit switchedor packetswitched
Most regularnetwork topologiesadmitrelativelysimpleroutingschemesthatarestatic,deterministic,andminimal
Copyright c
�
2004,MichaelT. Heath– p.2/38
Example: Routing in Mesh
In 2-D mesh,messageisforwardedalongrow (orcolumn)of sendingnodeuntil column(or row) ofdestinationnodeis reached,thenforwardedalongdestinationcolumn(or row)until destinationnodeisreached
In 3-D mesh,forwardingtakesplacesimilarly alongeachdimensionuntildestinationnodeis reached
Copyright c
�
2004,MichaelT. Heath– p.3/38
Example: Routing in Hypercube
In hypercube,if currentnodenumberdiffersfrom thatofdestinationnodein
�
th bit,thenmessageis forwardedtoadjacentnodewith oppositevaluein
�
th bit
Messagereachesdestinationnodein
�
steps,where�
isnumberof bit positionsinwhichsourceanddestinationnodenumbersdiffer, whichis atmost
��� � ��
011
111
010
101100
110
000 001
Copyright c
�
2004,MichaelT. Heath– p.4/38
MessageRouting
Thereis oftenconsiderablefreedomin choosingroutingscheme
In 2-D or 3-D meshonecantake respectivedimensionsin any orderIn hypercube,bits thatdiffer betweensourceanddestinationnodescanbe“corrected”in any order
Thus,thereareoftenmultiple possiblepathsfor anygivenmessage,andthis freedomcansometimesbeexploitedfor improvedperformanceor faulttolerance
Copyright c
�
2004,MichaelT. Heath– p.5/38
Cut-Thr oughRouting
Earlydistributed-memorymulticomputersusedstore-and-forward routing: ateachnodealongpathfrom sourceto destination,entiremessageisreceivedandstoredbeforebeingforwarded
Moderncommunicationnetworksusecut-through(or wormhole) routing,in whichmessageis brokeninto smallersegmentsthatarepipelinedthroughnetwork
Eachnodeonpathforwardseachsegmentassoonasit is received,which improvesperformanceandreducesbuffer spacerequirements
Copyright c
�
2004,MichaelT. Heath– p.6/38
Store-and-Forward vs Cut-Thr ough
PPPP
0
1
2
3
time
PPPP
0
1
2
3
time
store-and-forward
cut-through
Copyright c
�
2004,MichaelT. Heath– p.7/38
Cut-Thr oughRouting
In effect,cut-throughroutingestablishesvirtualcircuit betweensourceanddestinationnodes
Caremustbetakenin designingroutingalgorithmtoavoid potentialdeadlockwhenmultiple messagescontendfor samelink
Cut-throughroutingmakesnetwork distancelessimportantfor individualmessages,somatchingproblemtopologyto network topologyis lesscrucial
Aggregatebandwidthconstraintsstill necessitatesomeattentionto locality, however
Copyright c
�
2004,MichaelT. Heath– p.8/38
Communication Concurrency
Wehave thusfarconsideredonly point-to-pointcommunication,in whichonepairof processorscommunicatewith eachother
If many processorscommunicatesimultaneously,overall performanceis affectedby degreeofconcurrency supportedby communicationsystem
It mayor maynotbepossiblefor processortosendandreceiveonsamelink simultaneouslysendononelink andreceiveonanotherlinksimultaneouslysendand/orreceiveonmultiple linkssimultaneously
Copyright c
�
2004,MichaelT. Heath– p.9/38
Communication Concurrency
Wecanusuallyallow for thesedistinctionsbyappropriatelydefiningwhatwemeanby “step” ofgivencommunicationpattern
Effect is to multiply overall costby constantfactorin network whosedegreedoesnot varywith numberof processors
Correspondingfactormaygrow with numberofprocessorsin network having variabledegree
Copyright c
�
2004,MichaelT. Heath– p.10/38
CollectiveCommunication
Collectivecommunicationinvolvesmultiple nodessimultaneously
Examplesoccurringfrequentlyincludebroadcast: one-to-allreduction: all-to-onemultinodebroadcast: all-to-allscatter/gather: one-to-all/all-to-onetotal exchange: personalizedall-to-allscanor prefixcircular shiftbarrier
Copyright c
�
2004,MichaelT. Heath– p.11/38
Broadcast
In broadcast, sourcenodecommunicatessinglemessageto � � �
othernodes
Sourcenodecouldsend� � �
separatemessagesserially, oneto eachof othernodes
Efficiency canbeimprovedby exploitingparallelismandfactthatmessagesoftenneedto beroutedthroughintermediatenodesanyway
Copyright c
�
2004,MichaelT. Heath– p.12/38
GenericBroadcastAlgorithm
1. If source me,receivemessage
2. Sendmessageto eachof my directneighborswhohave notalreadyreceivedit
Copyright c
�
2004,MichaelT. Heath– p.13/38
Broadcastin Mesh or Torus
4
4 3
3 2
2 1 4 4 4
4444
43333
33
22
2
4
2
1
1
2-D mesh
1-D mesh
1-D torus(ring)
Copyright c
�
2004,MichaelT. Heath– p.14/38
Broadcastin Hypercube
4-cube
3-cube2-cube
Copyright c
�
2004,MichaelT. Heath– p.15/38
Costof Broadcast
Broadcastalgorithmgeneratesspanningtreeforgivennetwork, with sourcenodeasroot
Heightof spanningtreedeterminestotalnumberofstepsrequired
Costof broadcastfor messageof length
�
is1-D mesh:
� � � � ����� � � �
2-D mesh:
� � � � � ���� � � �
Hypercube:
�� � � � ����� � � �
Copyright c
�
2004,MichaelT. Heath– p.16/38
EnhancedBroadcast
For longmessagefor whichbandwidthdominateslatency, network bandwidthmaybebetterexploitedby breakingmessageinto piecesandeither
pipelinepiecesalongsinglespanningtree,orsendeachpiecealongdifferentspanningtreewith sameroot
In hypercubewith
� �
nodes,with any givennodeasroot, thereare
�
edge-disjointspanningtrees,all ofwhichcanpotentiallybeexploitedsimultaneouslyinbroadcast
Copyright c
�
2004,MichaelT. Heath– p.17/38
Reduction
In reduction, datafrom all � nodesarecombinedbyapplyingspecifiedassociative operation(e.g.,sum,product,max,min, logicalOR, logicalAND) toproduceoverall result
As with broadcast,reductionusesspanningtreeforgivennetwork, but dataflow is in oppositedirection,from leavesto root
Incomingresultsarecombinedwith receivingnode’s valuebeforeforwardingto its parent
Final resultendsupat root node;if it is alsoneededby othernodes,final resultcanbebroadcast
Copyright c
�
2004,MichaelT. Heath– p.18/38
GenericReductionAlgorithm
1. Receive messagefrom eachof my childreninspanningtree,if any
2. Combinereceivedvalueswith my own usingspecifiedassociative operation
3. Sendresultto my parent,if any
Copyright c
�
2004,MichaelT. Heath– p.19/38
Reduction in Mesh
1221
4
3
3
2
2
1
1
4 332222
1 1 1 1
11
3 4
1 1
1-D torus(ring) 2-D mesh
1-D mesh
Copyright c
�
2004,MichaelT. Heath– p.20/38
Reduction in Hypercube
4-cube
2-cube 3-cube
Copyright c
�
2004,MichaelT. Heath– p.21/38
Costof Reduction
Reductionalgorithmusessamespanningtreeasbroadcast,but messagesflow in reversedirection
Heightof spanningtreedeterminestotalnumberofstepsrequired
Costof reductionfor messageof length
�
is1-D mesh:
� � � � �����
�� � ��� �
2-D mesh:
� � � � � ����
��� � �� �
Hypercube:
�� � � � �����
��� � ��� �
where
��� is costperwordof associative reductionoperation
Copyright c
�
2004,MichaelT. Heath– p.22/38
Multinode Broadcast
In multinodebroadcast, eachnodesendsmessagetoall othernodes
This all-to-all operationis logically equivalentto �
broadcasts,onefrom eachnode,andcouldbeimplementedthatway
Efficiency canoftenbeimprovedby overlappingseparatebroadcasts
Total costof multinodebroadcastdependsstronglyondegreeof overlapsupportedby targetsystem
Multinodebroadcastneedbenomorecostlythanstandardbroadcastif aggressiveoverlappingofcommunicationis supported
Copyright c
�
2004,MichaelT. Heath– p.23/38
Multinode Broadcastin Torus
In 1-D torus,broadcastcanbeinitiatedfrom eachnodesimultaneouslyin samedirectionaroundring
After � � �
steps,eachnodehasreceiveddatafromall othernodes,andmultinodebroadcastiscomplete,andcostis sameasstandardbroadcast
In 2-D torus,ring algorithmcanbeappliedfirst ineachrow, thenin eachcolumn(or viceversa)
Thereare
� � � � �
stepsfor square2-D torus
Messagesfor secondphasearelargerby factorof
� , sototal amountof datatransferredis stillproportionalto �
Copyright c
�
2004,MichaelT. Heath– p.24/38
Multinode Broadcastin Hypercube
In hypercubewith
� �
nodes,multinodebroadcastcanbeimplementedby successive pairwiseexchangesin eachof
�
dimensions,with messagesconcatenatedateachstage
Thereare
��� � � �
stepsfor hypercube,but growth inmessagesizesmeansthattotal communicationvolumeis still proportionalto �
Copyright c
�
2004,MichaelT. Heath– p.25/38
Reductionvia Multinode Broadcast
If insteadof concatenatingmessagesthey arecombinedusingspecifiedassociative operation,multinodebroadcastcanbeusedto implementreduction
Sinceall nodesreceivefinal result,thisapproachavoidsrootnodehaving to broadcastit afterreduction,therebysaving factorof up to two in costif resultis neededby all nodes
Copyright c
�
2004,MichaelT. Heath– p.26/38
PersonalizedCollectiveComm.
In broadcastor multinodebroadcast,givennodesendssamemessageto all othernodes
In analogouspersonalizedversions,distinctmessageis sentto eachothernode
Scatter: analogousto broadcast,but root sendsdistinctmessageto eachothernodeGather: analogousto reduction,but datareceivedby root areconcatenatedratherthancombinedusingassociative operationTotal exchange: analogousto multinodebroadcast,but eachnodeexchangesdistinctmessagewith eachothernode
Copyright c
�
2004,MichaelT. Heath– p.27/38
PersonalizedCollectiveComm.
Scatterusesspanningtreealgorithmsimilar tostandardbroadcast,but multiple messagesaretransmittedtogetherateachstage
Rootnodesendsmessagesto eachof its childrencontainingdatafor entiresubtreeof which thatchild is rootEachchild retainsits own portionof dataandforwardsappropriatesubsetsof remaindertoeachof its childrenEventualyevery nodereceivesits distinctmessage
Copyright c
�
2004,MichaelT. Heath– p.28/38
PersonalizedCollectiveComm.
Gatherusesalgorithmssimilar to reduction,exceptdataareconcatenatedateachstageratherthancombinedusingassociative operation
Total exchangeusesalgorithmsimilar to multinodebroadcast,exceptbroadcastsarereplacedby scatteroperations,whichareoverlappedasin multinodebroadcast
Copyright c
�
2004,MichaelT. Heath– p.29/38
Scanor Prefix
In scanor prefixoperation,datavalues
����� ����� � � �� � � ! �aregiven,onepernode,alongwith specifiedassociative operation
Sequenceof partialresults
"�#� "� � � � �� "� ! �
is to becomputed,where
" � �$� �$� % % % � �
and " � is to resideonnode
��
� &� � � � � � �
Copyright c
�
2004,MichaelT. Heath– p.30/38
Scanor Prefix
Scanoperationcanbeimplementedby algorithmssimilar to thosefor multinodebroadcast,exceptthatintermediateresultsreceivedby eachnodeareselectively combined,dependingonsendingnode’snumbering,beforeforwarding
Copyright c
�
2004,MichaelT. Heath– p.31/38
Cir cular Shift
In circular
�
-shift, with
& ' � ' � , node
�
sendsdatato node
� � � (� ) �
Suchoperationsarisein somefinite differenceandmatrix computationsandstringmatchingproblems
Circularshift canbeimplementedquitenaturallyinring network
Implementingcircularshift in othernetworkscanbeconsiderablymorecomplicated,but basicallyitinvolvesembeddingring or seriesof ringsin givennetwork
Copyright c
�
2004,MichaelT. Heath– p.32/38
Barrier
Barrier: synchronizationmechanismin whichallprocessorsmustreachbarrierbeforeany processoris allowedto proceedbeyondit
Implementationof barrierdependsonunderlyingmemoryarchitectureandnetwork
In distributed-memorysystems,barrieris usuallyimplementedby messagepassing,usingalgorithmssimilar to thosefor all-to-all communication
In shared-memorysystems,barrieris usuallyimplementedusingtest-and-set,semaphore,or othermechanismfor enforcingmutualexclusion
Copyright c
�
2004,MichaelT. Heath– p.33/38
References
M. Barnett,R. Littlefield, D. Payne,andR. vandeGeijn,
Globalcombinealgorithmsfor 2-D mesheswith wormhole
routing,J. Parallel Distrib. Comput.24:191-201,1995
M. Barnett,D. Payne,R. vandeGeijn,andJ.Watts,
Broadcastingonmesheswith wormholerouting,J. Parallel
Distrib. Comput.35:111-122,1996
D. P. Bertsekas,C. Ozveren,G. D. Stamoulis,P. Tseng,and
J.N. Tsitsiklis,Optimalcommunicationalgorithmsfor
hypercubes,J. Parallel Distrib. Comput.11:263-275,1991
G. E. Blelloch,Scansasprimitive operations,IEEE Trans.
Comput.38:1526-1538,1989
Copyright c
�
2004,MichaelT. Heath– p.34/38
References
V. G. Cerf,Networks,ScientificAmerican265(3):72-81,
September1991
W. J.Dally andC. L. Seitz,Deadlock-freeroutingin
multiprocessorinterconnectionnetworks,IEEETrans.
Comput.36:547-553,1987
A. Grama,A. Gupta,G. Karypis,andV. Kumar, Introduction
to Parallel Computing, 2nd.ed.,Addison-Wesley, 2003
D. Hensgen,R. Finkel, andU. Manber, Two algorithmsfor
barriersynchronization,Internat.J. Parallel Prog. 17:1-17,
1988
Copyright c
�
2004,MichaelT. Heath– p.35/38
References
S.L. JohnssonandC.-T. Ho, Optimumbroadcastingand
personalizedcommunicationin hypercubes,IEEETrans.
Comput.38:1249-1268,1989
P. KermaniandL. Kleinrock,Virtual cut-through:a new
communicationswitchingtechnique,ComputerNetworks
3:267-286,1979
C. P. Kruskal,L. Rudolph,andM. Snir, Thepower of parallel
prefix, IEEETrans.Comput.C-34:965-968,1985
R. E. LadnerandM. J.Fischer, Parallelprefix computation,
J. ACM 27:831-838,1980
Copyright c
�
2004,MichaelT. Heath– p.36/38
References
O. McBryanandE. F. VandeVelde,Hypercubealgorithmsand
implementations,SIAMJ. Sci.Stat.Comput.8:s227-s287,
1987
L. M. Ni andP. K. McKinley, A survey of wormholerouting
techniquesin directnetworks,IEEEComputer26(2):62-76,
1993
S.Ranka,Y. Won,andS.Sahni,Programminga hypercube
multicomputer, IEEESoftware 69-77,September1988
Y. SaadandM. H. Schultz,Datacommunicationin
hypercubes,J. Parallel Distrib. Comput.6:115-135,1989
Copyright c
�
2004,MichaelT. Heath– p.37/38
References
Y. SaadandM. H. Schultz,Datacommunicationin parallel
architectures,Parallel Computing11:131-150,1989
Q. F. StoutandB. Wagar, Intensive hypercubecommunication,
J. Parallel Distrib. Comput.10:167-181,1990
R. vandeGeijn,Onglobalcombineoperations,J. Parallel
Distrib. Comput.22:324-328,1994
Copyright c
�
2004,MichaelT. Heath– p.38/38