CSE 412/CS 454/MATH 486 Parallel Numerical Algorithms 5...

Post on 22-Jul-2020

8 views 0 download

Transcript of CSE 412/CS 454/MATH 486 Parallel Numerical Algorithms 5...

CSE 412/CS454/MATH 486Parallel Numerical Algorithms

5. Inter processorCommunicationProf. MichaelT. Heath

Departmentof ComputerScience

Universityof Illinois atUrbana-Champaign

Copyright c

2004,MichaelT. Heath– p.1/38

MessageRouting

If messageis sentbetweenprocessorsthatarenotdirectly connected,thenit mustberoutedthroughintermediateprocessors

Messageroutingalgorithmscanbeminimal or nonminimalstaticor dynamicdeterministicor randomizedcircuit switchedor packetswitched

Most regularnetwork topologiesadmitrelativelysimpleroutingschemesthatarestatic,deterministic,andminimal

Copyright c

2004,MichaelT. Heath– p.2/38

Example: Routing in Mesh

In 2-D mesh,messageisforwardedalongrow (orcolumn)of sendingnodeuntil column(or row) ofdestinationnodeis reached,thenforwardedalongdestinationcolumn(or row)until destinationnodeisreached

In 3-D mesh,forwardingtakesplacesimilarly alongeachdimensionuntildestinationnodeis reached

Copyright c

2004,MichaelT. Heath– p.3/38

Example: Routing in Hypercube

In hypercube,if currentnodenumberdiffersfrom thatofdestinationnodein

th bit,thenmessageis forwardedtoadjacentnodewith oppositevaluein

th bit

Messagereachesdestinationnodein

steps,where�

isnumberof bit positionsinwhichsourceanddestinationnodenumbersdiffer, whichis atmost

��� � ��

011

111

010

101100

110

000 001

Copyright c

2004,MichaelT. Heath– p.4/38

MessageRouting

Thereis oftenconsiderablefreedomin choosingroutingscheme

In 2-D or 3-D meshonecantake respectivedimensionsin any orderIn hypercube,bits thatdiffer betweensourceanddestinationnodescanbe“corrected”in any order

Thus,thereareoftenmultiple possiblepathsfor anygivenmessage,andthis freedomcansometimesbeexploitedfor improvedperformanceor faulttolerance

Copyright c

2004,MichaelT. Heath– p.5/38

Cut-Thr oughRouting

Earlydistributed-memorymulticomputersusedstore-and-forward routing: ateachnodealongpathfrom sourceto destination,entiremessageisreceivedandstoredbeforebeingforwarded

Moderncommunicationnetworksusecut-through(or wormhole) routing,in whichmessageis brokeninto smallersegmentsthatarepipelinedthroughnetwork

Eachnodeonpathforwardseachsegmentassoonasit is received,which improvesperformanceandreducesbuffer spacerequirements

Copyright c

2004,MichaelT. Heath– p.6/38

Store-and-Forward vs Cut-Thr ough

PPPP

0

1

2

3

time

PPPP

0

1

2

3

time

store-and-forward

cut-through

Copyright c

2004,MichaelT. Heath– p.7/38

Cut-Thr oughRouting

In effect,cut-throughroutingestablishesvirtualcircuit betweensourceanddestinationnodes

Caremustbetakenin designingroutingalgorithmtoavoid potentialdeadlockwhenmultiple messagescontendfor samelink

Cut-throughroutingmakesnetwork distancelessimportantfor individualmessages,somatchingproblemtopologyto network topologyis lesscrucial

Aggregatebandwidthconstraintsstill necessitatesomeattentionto locality, however

Copyright c

2004,MichaelT. Heath– p.8/38

Communication Concurrency

Wehave thusfarconsideredonly point-to-pointcommunication,in whichonepairof processorscommunicatewith eachother

If many processorscommunicatesimultaneously,overall performanceis affectedby degreeofconcurrency supportedby communicationsystem

It mayor maynotbepossiblefor processortosendandreceiveonsamelink simultaneouslysendononelink andreceiveonanotherlinksimultaneouslysendand/orreceiveonmultiple linkssimultaneously

Copyright c

2004,MichaelT. Heath– p.9/38

Communication Concurrency

Wecanusuallyallow for thesedistinctionsbyappropriatelydefiningwhatwemeanby “step” ofgivencommunicationpattern

Effect is to multiply overall costby constantfactorin network whosedegreedoesnot varywith numberof processors

Correspondingfactormaygrow with numberofprocessorsin network having variabledegree

Copyright c

2004,MichaelT. Heath– p.10/38

CollectiveCommunication

Collectivecommunicationinvolvesmultiple nodessimultaneously

Examplesoccurringfrequentlyincludebroadcast: one-to-allreduction: all-to-onemultinodebroadcast: all-to-allscatter/gather: one-to-all/all-to-onetotal exchange: personalizedall-to-allscanor prefixcircular shiftbarrier

Copyright c

2004,MichaelT. Heath– p.11/38

Broadcast

In broadcast, sourcenodecommunicatessinglemessageto � � �

othernodes

Sourcenodecouldsend� � �

separatemessagesserially, oneto eachof othernodes

Efficiency canbeimprovedby exploitingparallelismandfactthatmessagesoftenneedto beroutedthroughintermediatenodesanyway

Copyright c

2004,MichaelT. Heath– p.12/38

GenericBroadcastAlgorithm

1. If source me,receivemessage

2. Sendmessageto eachof my directneighborswhohave notalreadyreceivedit

Copyright c

2004,MichaelT. Heath– p.13/38

Broadcastin Mesh or Torus

4

4 3

3 2

2 1 4 4 4

4444

43333

33

22

2

4

2

1

1

2-D mesh

1-D mesh

1-D torus(ring)

Copyright c

2004,MichaelT. Heath– p.14/38

Broadcastin Hypercube

4-cube

3-cube2-cube

Copyright c

2004,MichaelT. Heath– p.15/38

Costof Broadcast

Broadcastalgorithmgeneratesspanningtreeforgivennetwork, with sourcenodeasroot

Heightof spanningtreedeterminestotalnumberofstepsrequired

Costof broadcastfor messageof length

is1-D mesh:

� � � � ����� � � �

2-D mesh:

� � � � � ���� � � �

Hypercube:

�� � � � ����� � � �

Copyright c

2004,MichaelT. Heath– p.16/38

EnhancedBroadcast

For longmessagefor whichbandwidthdominateslatency, network bandwidthmaybebetterexploitedby breakingmessageinto piecesandeither

pipelinepiecesalongsinglespanningtree,orsendeachpiecealongdifferentspanningtreewith sameroot

In hypercubewith

� �

nodes,with any givennodeasroot, thereare

edge-disjointspanningtrees,all ofwhichcanpotentiallybeexploitedsimultaneouslyinbroadcast

Copyright c

2004,MichaelT. Heath– p.17/38

Reduction

In reduction, datafrom all � nodesarecombinedbyapplyingspecifiedassociative operation(e.g.,sum,product,max,min, logicalOR, logicalAND) toproduceoverall result

As with broadcast,reductionusesspanningtreeforgivennetwork, but dataflow is in oppositedirection,from leavesto root

Incomingresultsarecombinedwith receivingnode’s valuebeforeforwardingto its parent

Final resultendsupat root node;if it is alsoneededby othernodes,final resultcanbebroadcast

Copyright c

2004,MichaelT. Heath– p.18/38

GenericReductionAlgorithm

1. Receive messagefrom eachof my childreninspanningtree,if any

2. Combinereceivedvalueswith my own usingspecifiedassociative operation

3. Sendresultto my parent,if any

Copyright c

2004,MichaelT. Heath– p.19/38

Reduction in Mesh

1221

4

3

3

2

2

1

1

4 332222

1 1 1 1

11

3 4

1 1

1-D torus(ring) 2-D mesh

1-D mesh

Copyright c

2004,MichaelT. Heath– p.20/38

Reduction in Hypercube

4-cube

2-cube 3-cube

Copyright c

2004,MichaelT. Heath– p.21/38

Costof Reduction

Reductionalgorithmusessamespanningtreeasbroadcast,but messagesflow in reversedirection

Heightof spanningtreedeterminestotalnumberofstepsrequired

Costof reductionfor messageof length

is1-D mesh:

� � � � �����

�� � ��� �

2-D mesh:

� � � � � ����

��� � �� �

Hypercube:

�� � � � �����

��� � ��� �

where

��� is costperwordof associative reductionoperation

Copyright c

2004,MichaelT. Heath– p.22/38

Multinode Broadcast

In multinodebroadcast, eachnodesendsmessagetoall othernodes

This all-to-all operationis logically equivalentto �

broadcasts,onefrom eachnode,andcouldbeimplementedthatway

Efficiency canoftenbeimprovedby overlappingseparatebroadcasts

Total costof multinodebroadcastdependsstronglyondegreeof overlapsupportedby targetsystem

Multinodebroadcastneedbenomorecostlythanstandardbroadcastif aggressiveoverlappingofcommunicationis supported

Copyright c

2004,MichaelT. Heath– p.23/38

Multinode Broadcastin Torus

In 1-D torus,broadcastcanbeinitiatedfrom eachnodesimultaneouslyin samedirectionaroundring

After � � �

steps,eachnodehasreceiveddatafromall othernodes,andmultinodebroadcastiscomplete,andcostis sameasstandardbroadcast

In 2-D torus,ring algorithmcanbeappliedfirst ineachrow, thenin eachcolumn(or viceversa)

Thereare

� � � � �

stepsfor square2-D torus

Messagesfor secondphasearelargerby factorof

� , sototal amountof datatransferredis stillproportionalto �

Copyright c

2004,MichaelT. Heath– p.24/38

Multinode Broadcastin Hypercube

In hypercubewith

� �

nodes,multinodebroadcastcanbeimplementedby successive pairwiseexchangesin eachof

dimensions,with messagesconcatenatedateachstage

Thereare

��� � � �

stepsfor hypercube,but growth inmessagesizesmeansthattotal communicationvolumeis still proportionalto �

Copyright c

2004,MichaelT. Heath– p.25/38

Reductionvia Multinode Broadcast

If insteadof concatenatingmessagesthey arecombinedusingspecifiedassociative operation,multinodebroadcastcanbeusedto implementreduction

Sinceall nodesreceivefinal result,thisapproachavoidsrootnodehaving to broadcastit afterreduction,therebysaving factorof up to two in costif resultis neededby all nodes

Copyright c

2004,MichaelT. Heath– p.26/38

PersonalizedCollectiveComm.

In broadcastor multinodebroadcast,givennodesendssamemessageto all othernodes

In analogouspersonalizedversions,distinctmessageis sentto eachothernode

Scatter: analogousto broadcast,but root sendsdistinctmessageto eachothernodeGather: analogousto reduction,but datareceivedby root areconcatenatedratherthancombinedusingassociative operationTotal exchange: analogousto multinodebroadcast,but eachnodeexchangesdistinctmessagewith eachothernode

Copyright c

2004,MichaelT. Heath– p.27/38

PersonalizedCollectiveComm.

Scatterusesspanningtreealgorithmsimilar tostandardbroadcast,but multiple messagesaretransmittedtogetherateachstage

Rootnodesendsmessagesto eachof its childrencontainingdatafor entiresubtreeof which thatchild is rootEachchild retainsits own portionof dataandforwardsappropriatesubsetsof remaindertoeachof its childrenEventualyevery nodereceivesits distinctmessage

Copyright c

2004,MichaelT. Heath– p.28/38

PersonalizedCollectiveComm.

Gatherusesalgorithmssimilar to reduction,exceptdataareconcatenatedateachstageratherthancombinedusingassociative operation

Total exchangeusesalgorithmsimilar to multinodebroadcast,exceptbroadcastsarereplacedby scatteroperations,whichareoverlappedasin multinodebroadcast

Copyright c

2004,MichaelT. Heath– p.29/38

Scanor Prefix

In scanor prefixoperation,datavalues

����� ����� � � �� � � ! �aregiven,onepernode,alongwith specifiedassociative operation

Sequenceof partialresults

"�#� "� � � � �� "� ! �

is to becomputed,where

" � �$� �$� % % % � �

and " � is to resideonnode

��

� &� � � � � � �

Copyright c

2004,MichaelT. Heath– p.30/38

Scanor Prefix

Scanoperationcanbeimplementedby algorithmssimilar to thosefor multinodebroadcast,exceptthatintermediateresultsreceivedby eachnodeareselectively combined,dependingonsendingnode’snumbering,beforeforwarding

Copyright c

2004,MichaelT. Heath– p.31/38

Cir cular Shift

In circular

-shift, with

& ' � ' � , node

sendsdatato node

� � � (� ) �

Suchoperationsarisein somefinite differenceandmatrix computationsandstringmatchingproblems

Circularshift canbeimplementedquitenaturallyinring network

Implementingcircularshift in othernetworkscanbeconsiderablymorecomplicated,but basicallyitinvolvesembeddingring or seriesof ringsin givennetwork

Copyright c

2004,MichaelT. Heath– p.32/38

Barrier

Barrier: synchronizationmechanismin whichallprocessorsmustreachbarrierbeforeany processoris allowedto proceedbeyondit

Implementationof barrierdependsonunderlyingmemoryarchitectureandnetwork

In distributed-memorysystems,barrieris usuallyimplementedby messagepassing,usingalgorithmssimilar to thosefor all-to-all communication

In shared-memorysystems,barrieris usuallyimplementedusingtest-and-set,semaphore,or othermechanismfor enforcingmutualexclusion

Copyright c

2004,MichaelT. Heath– p.33/38

References

M. Barnett,R. Littlefield, D. Payne,andR. vandeGeijn,

Globalcombinealgorithmsfor 2-D mesheswith wormhole

routing,J. Parallel Distrib. Comput.24:191-201,1995

M. Barnett,D. Payne,R. vandeGeijn,andJ.Watts,

Broadcastingonmesheswith wormholerouting,J. Parallel

Distrib. Comput.35:111-122,1996

D. P. Bertsekas,C. Ozveren,G. D. Stamoulis,P. Tseng,and

J.N. Tsitsiklis,Optimalcommunicationalgorithmsfor

hypercubes,J. Parallel Distrib. Comput.11:263-275,1991

G. E. Blelloch,Scansasprimitive operations,IEEE Trans.

Comput.38:1526-1538,1989

Copyright c

2004,MichaelT. Heath– p.34/38

References

V. G. Cerf,Networks,ScientificAmerican265(3):72-81,

September1991

W. J.Dally andC. L. Seitz,Deadlock-freeroutingin

multiprocessorinterconnectionnetworks,IEEETrans.

Comput.36:547-553,1987

A. Grama,A. Gupta,G. Karypis,andV. Kumar, Introduction

to Parallel Computing, 2nd.ed.,Addison-Wesley, 2003

D. Hensgen,R. Finkel, andU. Manber, Two algorithmsfor

barriersynchronization,Internat.J. Parallel Prog. 17:1-17,

1988

Copyright c

2004,MichaelT. Heath– p.35/38

References

S.L. JohnssonandC.-T. Ho, Optimumbroadcastingand

personalizedcommunicationin hypercubes,IEEETrans.

Comput.38:1249-1268,1989

P. KermaniandL. Kleinrock,Virtual cut-through:a new

communicationswitchingtechnique,ComputerNetworks

3:267-286,1979

C. P. Kruskal,L. Rudolph,andM. Snir, Thepower of parallel

prefix, IEEETrans.Comput.C-34:965-968,1985

R. E. LadnerandM. J.Fischer, Parallelprefix computation,

J. ACM 27:831-838,1980

Copyright c

2004,MichaelT. Heath– p.36/38

References

O. McBryanandE. F. VandeVelde,Hypercubealgorithmsand

implementations,SIAMJ. Sci.Stat.Comput.8:s227-s287,

1987

L. M. Ni andP. K. McKinley, A survey of wormholerouting

techniquesin directnetworks,IEEEComputer26(2):62-76,

1993

S.Ranka,Y. Won,andS.Sahni,Programminga hypercube

multicomputer, IEEESoftware 69-77,September1988

Y. SaadandM. H. Schultz,Datacommunicationin

hypercubes,J. Parallel Distrib. Comput.6:115-135,1989

Copyright c

2004,MichaelT. Heath– p.37/38

References

Y. SaadandM. H. Schultz,Datacommunicationin parallel

architectures,Parallel Computing11:131-150,1989

Q. F. StoutandB. Wagar, Intensive hypercubecommunication,

J. Parallel Distrib. Comput.10:167-181,1990

R. vandeGeijn,Onglobalcombineoperations,J. Parallel

Distrib. Comput.22:324-328,1994

Copyright c

2004,MichaelT. Heath– p.38/38