2/94


Conditions of Parallelism

The exploitation of parallelism in computing requires

understanding the basic theor associated !ith it"

#rogress is needed in se$eral areas%computation models for parallel computing

interprocessor communication in parallel architectures

integration of parallel sstems into general en$ironments


3/94

Data dependences

The ordering relationship between statements is indicated by the data

dependence.

low dependence

!nti dependence

"utput dependence

#$" dependence

%nknown dependence



4/94

Data Dependence & '

low dependence: S' precedes S2( and at least one output of S' is

input to S2.

!ntidependence: S' precedes S2( and the output of S2 o)erlaps the

input to S'.

"utput dependence: S' and S2 write to the same output )ariable.

#$" dependence: two #$" statements *read$write+ reference the

same )ariable( and$or the same file.


5/94

Data Dependence & 2

%nknown dependence:

o The subscript of a )ariable is itself subscripted.

o The subscript does not contain the loop inde, )ariable.

o ! )ariable appears more than once with subscripts ha)ing different

coefficients of the loop )ariable *that is( different functions of the loop

)ariable+.

o The subscript is nonlinear in the loop inde, )ariable.

Parallel e,ecution of program segments which do not ha)e total data

independence can produce non&deterministic results.


6/94

Data dependence e,ample


&'% (oad )'* +&2% +dd )2* )'

&3% ,o$e )'* )3

&% &tore .* )'

&'

&2 &

&3


7/94

#$" dependence e,ample


&'% )ead /* +/1&2% )e!ind /

&3% rite /* ./1

&% )e!ind /

&' &314


8/94

Control dependence

The order of e,ecution of statements cannot be determined before

run time

o Conditional branches

o Successi)e operations of a looping procedure



9/94

Control dependence e,amples

Do2- # '( N

!*#+ C*#+

#*!*#+ ./T. -+ !*#+'

2- Continue

Do'- # '( N

#*!*#&'+ .01. -+ !*#+-

'- Continue



10/94

esource dependence

Concerned with the conflicts in using shared resources

o #nteger units

o loating&point units

o egisters

o 3emory areas

o !/%

o 4orkplace storage



11/94

5ernstein6s conditions

Set of conditions for two processes to e,ecute in parallel

#'"2 7

#2"' 7

"'"2 7



12/94

5ernstein6s Conditions & 2

#n terms of data dependencies( 5ernstein6s conditions imply that

two processes can e,ecute in parallel if they are flow&independent(

antiindependent( and output&independent.

The parallelism relation 88 is commutati)e *Pi88Pj impliesPj 88Pi +(

but not transiti)e *Pi88PjandPj88Pkdoes not implyPi88Pk+ .

Therefore( 88 is not an e9ui)alence relation.

#ntersection of the input sets is allowed.


13/94

%tiliing 5ernstein6s conditions

P': C D , 0

P2: 3 ; < C

P=: ! 5 < C

P>: C / < 3

P?: ; $ 0


#'

#3

#2 #

#5


14/94

%tiliing 5ernstein6s conditions


15/94

@ardware parallelism

! function of cost and performance tradeoffs

Displays the resource utiliation patterns of

simultaneously e,ecutable operations Denote the number of instruction issues per

machine cycle: k-issueprocessor

! multiprocessor system with n k&issue processorsshould be able to handle a ma,imum number of nk

threads of instructions simultaneously



16/94

Software parallelism

Defined by the control and data dependence of programs

! function of algorithm( programming style( and compiler

organiation

The program flow graph displays the patterns of simultaneously

e,ecutable operations



17/94

3ismatch between software and hardware parallelism & '

(' (2 (3 (

' 2

7 -

+ .

Maximum softwareparallelism (L=load,

X/+/- = arithmetic).

Ccle '

Ccle 2

Ccle 3


18/94

3ismatch between software and hardwareparallelism & 2

('

(2

(

(3'

2

7

-+

.

Ccle '

Ccle 2

Ccle 3

Ccle

Ccle 5

Ccle 6

Ccle 8

Same problem, but

cosideri! the

parallelism o a two-issuesuperscalar processor.


19/94

3ismatch between software and hardwareparallelism & =

('

(2

&'

'

7

(5

(3

(

&2

2

-

(6

.+

Ccle '

Ccle 2

Ccle 3

Ccle

Ccle 5

Ccle 6

Same problem,with two si!le-

issue processors

= iserted for

s"chroi#atio


20/94

Software parallelism

Control parallelism A allows two or more operations to be

performed concurrently

o Pipelining( multiple functional units

Data parallelism A almost the same operation is performed o)er

many data elements by many processors concurrently

o Code is easier to write and debug



21/94

Types of Software Parallelism

Control Parallelism A two or more operations can be performedsimultaneously. This can be detected by a compiler( or aprogrammer can e,plicitly indicate control parallelism by usingspecial language constructs or di)iding a program into multiple

processes. Data parallelism A multiple data elements ha)e the same operations

applied to them at the same time. This offers the highest potentialfor concurrency *in S#3D and 3#3D modes+. Synchroniation inS#3D machines handled by hardware.


22/94

Sol)ing the 3ismatch Problems

De)elop compilation support

edesign hardware for more efficient e,ploitation by compilers

%se large register files and sustained instruction pipelining.

@a)e the compiler fill the branch and load delay slots in code

generated for #SC processors.


23/94

The ole of Compilers

Compilers used to e,ploit hardware features to impro)eperformance.

#nteraction between compiler and architecture design is a necessityin modern computer de)elopment.

#t is not necessarily the case that more software parallelism willimpro)e performance in con)entional scalar processors.

The hardware and compiler should be designed at the same time.


24/94

Program Partitioning B Scheduling

The sie of the parts or pieces of a program that can be considered

for parallel e,ecution can )ary.

The sies are roughly classified using the term granule sie( or

simply granularity.

The simplest measure( for e,ample( is the number of instructions in

a program part.

;rain sies are usually described asfine( mediumor coarse(depending on the le)el of parallelism in)ol)ed.


25/94

/atency

Latencyis the time re9uired for communication between different

subsystems in a computer.

Memory latency( for e,ample( is the time re9uired by a processor to

access memory.

Synchronization latencyis the time re9uired for two processes to

synchronie their e,ecution.

Computational granularity and communicatoin latency are closelyrelated.


26/94

/e)els of Parallelism

9obs or programs

1nstructions

or statements

Non-recursi$e loopsor unfolded iterations

#rocedures* subroutines*

tas:s* or coroutines

&ubprograms* ;ob steps or

related parts of a program

cycles *se9uential program on one processor+

o M>' cycles *L processors+ & speedup '.'

o >> cycles *> processors+ & speedup '.O>


41/94

Static multiprocessor scheduling

;rain packing may not be optimal

Dynamic multiprocessor scheduling is an NP&hard problem

Node duplication is a static scheme for multiprocessor scheduling



42/94

Node duplication

Duplicate some nodes to eliminate idle time and reduce

communication delays

;rain packing and node duplication are often used Eointly to

determine the best grain sie and corresponding schedule



43/94

Schedule without node duplication


+*a*'

b*' c*?

a*?

c*'

C*'.*'

@*2 E*2

e*d*

#' #2 #2#'

+

@

.

1

E

C

1

6

'3

2'

28

23

20

'6

'

'2


44/94

Schedule with node duplication


+*a*'

b*' c*'

a*'

c*'

CA*'.*'

@*2 E*2

#' #2 #2#'

+

@

.

E

C

+

6

'0

'

'3

B

6

C*'

+A*

a*'

C8


45/94

;rain determination and scheduling optimiation

Step ': Construct a fine&grain program graph

Step 2: Schedule the fine&grain computation

Step =: ;rain packing to produce coarse grains

Step >: ;enerate a parallel schedule based on

the packed graph



46/94

Program low 3echanisms

Con)entional machines used control flow mechanism in which order of

program e,ecution e,plicitly stated in user programs.

Dataflow machineswhich instructions can be e,ecuted by determining

operand a)ailability.

Reduction machinestrigger an instruction6s e,ecution based on the

demand for its results.


47/94

Control low )s. Data low

Control flow machines used shared memory for instructions and

data. Since )ariables are updated by many instructions( there may

be side effects on other instructions. These side effects fre9uently

pre)ent parallel processing. Single processor systems are

inherently se9uential.

#nstructions in dataflow machines are unordered and can be

e,ecuted as soon as their operands are a)ailableF data is held in

the instructions themsel)es. Data tokensare passed from an

instruction to its dependents to trigger e,ecution.


48/94

Data low eatures

No need for

o shared memory

o program counter

o

control se9uencer

Special mechanisms are re9uired to

o detect data a)ailability

o match data tokens with instructions needing them

o enable chain reaction of asynchronous instruction e,ecution


49/94

! Dataflow !rchitecture & '

The !r)ind machine *3#T+ has N P0s and an N&by&Ninterconnection network.

0ach P0 has a token&matching mechanism that dispatchesonly instructions with data tokens a)ailable.

0ach datum is tagged witho address of instruction to which it belongso conte,t in which the instruction is being e,ecuted

Tagged tokens enter P0 through local path *pipelined+( andcan also be communicated to other P0s through the routing

network.


50/94

! Dataflow !rchitecture & 2

#nstruction address*es+ effecti)ely replace the program counter in a

control flow machine.

Conte,t identifier effecti)ely replaces the frame base register in a

control flow machine.

Since the dataflow machine matches the data tags from one

instruction with successors( synchronied instruction e,ecution is

implicit.


51/94

! Dataflow !rchitecture & =

!nI-structurein each P0 is pro)ided to eliminate e,cessi)e copying

of data structures.

0ach word of the #&structure has a two&bit tag indicating whether the

)alue is empty( full( or has pending read re9uests.

This is a retreat from the pure dataflow approach.

0,ample 2. shows a control flow and dataflow comparison.

Special compiler technology needed for dataflow machines.


52/94



53/94

Demand&Dri)en 3echanisms

Data&dri)en machines select instructions for e,ecution based on the

a)ailability of their operandsF this is essentially a bottom&up approach.

Demand&dri)en machines take a top&down approach( attempting to

e,ecute the instruction *a demander+ that yields the final result. This

triggers the e,ecution of instructions that yield its operands( and so

forth.

The demand&dri)en approach matches naturally with functional

programming languages *e.g. /#SP and SC@030+.


54/94

eduction 3achine 3odels

String&reduction model:o each demander gets a separate copy of the e,pression string to

e)aluateo each reduction step has an operator and embedded reference to

demand the corresponding operandso each operator is suspended while arguments are e)aluated

;raph&reduction model:o e,pression graph reduced by e)aluation of branches or subgraphs(

possibly in parallel( with demanders gi)en pointers to results ofreductions.

o

based on sharing of pointers to argumentsF tra)ersal and re)ersal ofpointers continues until constant arguments are encountered.


55/94

System #nterconnect !rchitectures

Direct networks for static connections

#ndirect networks for dynamic connections

Networks are used for

o internal connections in a centralied system among

processors

memory modules

#$" disk arrays

o distributed networking of multicomputer nodes


56/94

;oals and !nalysis

The goals of an interconnection network are to pro)ideo low&latencyo high data transfer rateo wide communication bandwidth

!nalysis includeso

latencyo bisection bandwidtho data&routing functionso scalability of parallel architecture


57/94

Network Properties and outing

Static networks: point&to&point direct connections that will not

change during program e,ecution

Dynamic networks:

o switched channels dynamically configured to match user program

communication demands

o include buses( crossbar switches( and multistage networks

5oth network types also used for inter&P0 data routing in S#3D

computers


58/94

Network Parameters

Network sie: The number of nodes in the graph used to represent

the network

Node Degree d: The number of edges incident to a node. Sum of in

degree and out degree

Network Diameter D: The ma,imum shortest path between any two

nodes



59/94

Network Parameters *cont.+

5isection 4idth:o Channel bisection width b: The minimum number of edges along the cut

that di)ides the network in two e9ual hal)eso 0ach channel has w bit wireso 4ire bisection width: 5bwF 5 is the wiring density of the network. #t

pro)ides a good indicator of tha ma, communication bandwidth along thebisection of the network



60/94

Terminology & '

Network usually represented by a graph with a finitenumber of nodes linked by directed or undirected edges.

Number of nodes in graph network size. Number of edges *links or channels+ incident on a node

node degree d *also note in and out degrees whenedges are directed+. Node degree reflects number of #$"ports associated with a node( and should ideally be smalland constant.

DiameterDof a network is the ma,imum shortest pathbetween any two nodes( measured by the number of linkstra)ersedF this should be as small as possible *from acommunication point of )iew+.

T i l


61/94

Terminology & 2

Channel isection width minimum number of edges cutto split a network into two parts each ha)ing the samenumber of nodes. Since each channel has wbit wires( thewire isection width! w. 5isection width pro)idesgood indication of ma,imum communication bandwidthalong the bisection of a network( and all other cross sections

should be bounded by the bisection width. "ire*or channel+ length length *e.g. weight+ of edges

between nodes. Network is symmetric if the topology is the same looking

from any nodeF these are easier to implement or to

program. "ther useful characteriing properties: homogeneousnodesI buffered channelsI nodes are switchesI


62/94

Data outing unctions

Shifting otating Permutation *one to one+ 5roadcast *one to all+ 3ulticast *many to many+ Personalied broadcast *one to many+ Shuffle 0,change 0tc.


63/94

Permutations

or n obEects there are nQ permutations by whichthe n obEects can be reordered.The set of allpermutations form a permutation group with

respect to a composition operation. Cyclenotation can be used to specify a permutationoperation.

Permutation *a( b( c+*d( e+ means: a&Rb( b&Rc(c&Ra( d&Re and e&Rd in a circular fashion. The

cycle *a( b( c+ has a period of =( and the cycle *d(e+ has a period of 2. will ha)e a period e9ual to2 , = .



64/94

Permutations *cont.+

Can be implemented using crossbar switches( multistage networks

or with shifting or broadcast operations.

Permutation capability is an indication of network6s data routing

capabilities



65/94

Perfect Shuffle and 0,change

Stone suggested the special permutation that entries according to

the mapping of the k&bit binary number a # kto c # k a*that is(

shifting ' bit to the left and wrapping it around to the least

significant bit position+.

The in)erse perfect shuffle re)erses the effect of the perfect shuffle.


66/94

Perfect Shuffle

Special permutation function

n 2kobEectsF each obEect representation re9uires k bits

Perfect shuffle maps , to y where:

o , * ,k&'( ( ,'( ,- +

o y * ,k&2( ( ,'( ,-( ,k&'+



67/94

0,change

n 2kobEectsF each obEect representation re9uires k bits

The e,change maps , to y where:

o , * ,k&'( ( ,'( ,- +

o y * ,k&'( ( ,'( ,-6 +

@ypercube routing functions are e,changes



68/94

@ypercube outing unctions

#f the )ertices of a n&dimensional cube are labeled with n&bit

numbers so that only one bit differs between each pair of adEacent

)ertices( then nrouting functions are defined by the bits in the node

*)erte,+ address.

or e,ample( with a =&dimensional cube( we can easily identify

routing functions that e,change data between nodes with addresses

that differ in the least significant( most significant( or middle bit.


69/94

5roadcast and 3ulticast

5roadcast: "ne&to&all mapping

3ulticast: one subset to another subset

Personalied 5roadcast: Personalied messages to only selected

recei)ers



70/94

Network Performance

unctionality

Network latency

5andwidth

@ardware comple,ity

Scalability



71/94

actors !ffecting Performance

unctionality A how the network supports data routing( interrupt

handling( synchroniation( re9uest$message combining( and

coherence

Network latency A worst&case time for a unit message to betransferred

5andwidth A ma,imum data rate

@ardware comple,ity A implementation costs for wire( logic(

switches( connectors( etc. Scalability A how easily does the scheme adapt to an increasing

number of processors( memories( etc.I


72/94

Static Networks

/inear !rray

ing and Chordal ing

5arrel Shifter

Tree and Star

at Tree

3esh and Torus


73/94

Static Networks A /inear !rray

N nodes connected by n&' links *not a bus+F segments between

different pairs of nodes can be used in parallel.

#nternal nodes ha)e degree 2F end nodes ha)e degree '.

Diameter n&'

5isection '

or small n( this is economical( but for large n( it is ob)iously

inappropriate.


74/94

Static Networks A ing( Chordal ing

/ike a linear array( but the two end nodes are connected by an n thlinkF the ring can be uni& or bi&directional. Diameter is n$2for abidirectional ring( or n for a unidirectional ring.

5y adding additional links *e.g. chords in a circle+( the node degreeis increased( and we obtain a chordal ring. This reduces the network

diameter. #n the limit( we obtain a fully&connected network( with a node

degree of n &' and a diameter of '.


75/94

Static Networks A 5arrel Shifter

/ike a ring( but with additional links between all pairs of nodes thatha)e a distance e9ual to a power of 2.

4ith a network of sie$ 2n( each node has degree d 2n &'( andthe network has diameterD n $2.

5arrel shifter connecti)ity is greater than any chordal ring of lower

node degree. 5arrel shifter much less comple, than fully&interconnected network.


76/94

Static Networks A Tree and Star

! k&le)el completely balanced binary tree will ha)e$ 2kA ' nodes(

with ma,imum node degree of = and network diameter is 2*k A '+.

The balanced binary tree is scalable( since it has a constant

ma,imum node degree.

! star is a two&le)el tree with a node degree d$A ' and a constant

diameter of 2.


77/94

Static Networks A at Tree

! fat tree is a tree in which the number of edges between nodes

increases closer to the root *similar to the way the thickness of limbs

increases in a real tree as we get closer to the root+.

The edges represent communication channels *wires+( and since

communication traffic increases as the root is approached( it seems

logical to increase the number of channels there.


78/94

Static Networks A 3esh and Torus

Pure meshA$ n k nodes with links between each adEacent pair of

nodes in a row or column *or higher degree+. This is not a

symmetric networkF interior node degree d 2k( diameter k *nA

'+. Illiac mesh*used in #lliac #H computer+ A wraparound is allowed(

thus reducing the network diameter to about half that of the

e9ui)alent pure mesh.

! torushas ring connections in each dimension( and is symmetric.!n n n binary torus has node degree of > and a diameter of 2 n $

2.


79/94

Static Networks A Systolic !rray

! systolic array is an arrangement of processing elements and

communication links designed specifically to match the

computation and communication re9uirements of a specific

algorithm *or class of algorithms+.

This specialied character may yield better performance than more

generalied structures( but also makes them more e,pensi)e( and

more difficult to program.


80/94

Static Networks A @ypercubes

! binary n&cube architecture with$ 2nnodes spanning along n

dimensions( with two nodes per dimension.

The hypercube scalability is poor( and packaging is difficult for

higher&dimensional hypercubes.


81/94

Static Networks A Cube&connected Cycles

k&cube connected cycles *CCC+ can be created from a k&cube by

replacing each )erte, of the k&dimensional hypercube by a ring of k

nodes.

! k&cube can be transformed to a k&CCC with k2knodes.

The maEor ad)antage of a CCC is that each node has a constant

degree *but longer latency+ than in the corresponding k&cube. #n

that respect( it is more scalable than the hypercube architecture.


82/94

Static Networks A k&ary n&Cubes

ings( meshes( tori( binary n&cubes( and "mega networks *to be

seen+ are topologically isomorphic to a family of k&ary n&cube

networks.

nis the dimension of the cube( and kis the radi,( or number of of

nodes in each dimension.

The number of nodes in the network($( is k n.

olding *alternating nodes between connections+ can be used to

a)oid the long end&around delays in the traditional

implementation.


83/94

Static Networks A k&ary n&Cubes

The cost of k&ary n&cubes is dominated by the amount of wire( not

the number of switches.

4ith constant wire bisection( low&dimensional networks with wider

channels pro)ide lower latecny( less contention( and higher hot&

spot throughput than higher&dimensional networks with narrower

channels.


84/94

Network Throughput

$etwork through%utA number of messages a network can handle in

a unit time inter)al.

"ne way to estimate is to calculate the ma,imum number of

messages that can be present in a network at any instant *itsca%acity+F throughput usually is some fraction of its capacity.

! hot s%otis a pair of nodes that accounts for a disproportionately

large portion of the total network traffic *possibly causing

congestion+.

&ot s%ot through%utis ma,imum rate at which messages can be

sent between two specific nodes.


85/94

3inimiing /atency

/atency is minimied when the network radi, kand dimension nare

chose so as to make the components of latency due to distance * of

hops+ and the message aspect ratioL $ "*message length / di)ided

by the channel width " + appro,imately e9ual.

This occurs at a )ery low dimension. or up to '-2> nodes( the best

dimension *in this respect+ is 2.


86/94

Dynamic Connection Networks

Dynamic connection networks can implement all communication

patterns based on program demands.

#n increasing order of cost and performance( these include

o

bus systemso multistage interconnection networks

o crossbar switch networks

Price can be attributed to the cost of wires( switches( arbiters( and

connectors. Performance is indicated by network bandwidth( data transfer rate(

network latency( and communication patterns supported.


87/94

Dynamic Networks A 5us Systems

! ussystem *contention us( time-sharing us+ has

o a collection of wires and connectors

o multiple modules *processors( memories( peripherals( etc.+ which

connect to the wireso data transactions between pairs of modules

5us supports only one transaction at a time.

5us arbitration logic must deal with conflicting re9uests.

/owest cost and bandwidth of all dynamic schemes.

3any bus standards are a)ailable.


88/94

Dynamic Networks A Switch 3odules

!n aswitch module has a inputs and b outputs. ! binary switch

has a 2.

#t is not necessary for a ( but usually a 2k( for some integer

k.

#n general( any input can be connected to one or more of the

outputs. @owe)er( multiple inputs may not be connected to the

same output.

4hen only one&to&one mappings are allowed( the switch is called a

crossar switch.


89/94

3ultistage Networks

#n general( any multistage network is comprised of a collection of a

switch modules and fi,ed network modules. The aswitch

modules are used to pro)ide )ariable permutation or other

reordering of the inputs( which are then further reordered by thefi,ed network modules.

! generic multistage network consists of a se9uence alternating

dynamic switches *with relati)ely small )alues for aand 'with

static networks *with larger numbers of inputs and outputs+. The

static networks are used to implement interstage connections *#SC+.


90/94

"mega Network

! 2 2 switch can be configured foro Straight&througho Crosso)ero %pper broadcast *upper input to both outputs+o /ower broadcast *lower input to both outputs+o *No output is a somewhat )acuous possibility as well+

4ith four stages of eight 2 2 switches( and a staticperfect shuffle for each of the four #SCs( a ' by '"mega network can be constructed *but not allpermutations are possible+.

#n general ( an n&input "mega network re9uires log 2nstages of 2 2 switches and n $ 2 switch modules.


91/94



92/94

5aseline Network

! baseline network can be shown to be topologically e9ui)alent to

other networks *including "mega+( and has a simple recursi)e

generation procedure.

Stage k *k -( '( + is an m m switch block *where m N $ 2k+

composed entirely of 2 2 switch blocks( each ha)ing two

configurations: straight through and crosso)er.


93/94

> > 5aseline Network


94/94

Crossbar Networks

! m n crossbar network can be used to pro)ide a constant latency

connection between de)icesF it can be thought of as a single stage switch.

Different types of de)ices can be connected( yielding different

constraints on which switches can be enabled.o 4ith m processors and n memories( one processor may be able to generate re9uests

for multiple memories in se9uenceF thus se)eral switches might be set in the same

row.

o or m m interprocessor communication( each P0 is connected to both an input

and an output of the crossbarF only one switch in each row and column can be

turned on simultaneously. !dditional control processors are used to manage the

crossbar itself.

Program and Network Properties

Documents

Transcript of Program and Network Properties