Program & network properties

Program & network properties

Module 4

Points to be covered Condition of parallelism(2.1 ACA Kai hwang)

o Data dependence and resource dependenceo Hardware and software dependenceo The role of compiler

Program partitioning and scheduling(2.2 ACA Kai hwang)

o Grain size and latencyo Grain packing and scheduling

• System interconnect architecture(2.4 ACA Kai Hwang)o Network properties and routingo Static connection network o Dynamic

connection network

Condition of parallelism The ability to execute several program

segments in parallel requires each segment to be independent of the other segments. We use a dependence graph to describe the relations.

The nodes of a dependence graph correspond to the program statement (instructions), and directed edges with different labels are used to represent the ordered relations among the statements.

Data dependence The ordering relationship between statements

is indicated by the data dependence. Five type of data dependence are defined below:-

1) Flow dependence2) Anti dependence3) Output dependence4) I/O dependence5) Unknown dependence

Flow Dependence A statement S2 is flow dependent on S1 if an

execution path exists from S1 to S2 and if at least one output (variables assigned) of S1 feeds in as input(operands to be used) to S2 and denoted as-:

S1 S2

Example-:S1:Load R1,AS2:ADD R2,R1

Anti Dependence Statement S2 is anti dependent on the

statement S1 if S2 follows S1 in the program order and if the output of S2 overlaps the input to S1 and denoted as :

Example-: S1:Add R2,R1S2:Move R1,R3

Output dependenceTwo statements are output dependent if they

produce (write) the same output variable.

Example-:S1:Load R1,AS2:Move R1,R3

I/O Read and write are I/O statements. I/O

dependence occurs not because the same variable is involved but because the same file referenced by both I/O statement.

Example-:S1:Read(4),A(I)S3:Write(4),A(I)

Unknown Dependence The dependence relation between two

statements cannot be determined.Example-:Indirect Addressing

S1 Load R1, A S2 Add R2, R1 S3 Move R1, R3 S4 Store B, R1Flow dependency S1to S2 S3 to S4 S2 to S2 Anti-dependency S2to S3 Output dependency S1 toS3

Control Dependence This refers to the situation where the order of

the execution of statements cannot be determined before run time.

For example all condition statement, where the flow of statement depends on the output.

Different paths taken after a conditional branch may depend on the data hence we need to eliminate this data dependence among the instructions.

This dependence also exists between operations performed in successive iterations of looping procedure.

Control dependence often prohibits parallelism from being exploited.

Control-independent example:for (i=0;i<n;i++) {a[i] = c[i];if (a[i] < 0) a[i] = 1;} Control-dependent example:for (i=1;i<n;i++) {if (a[i-1] < 0) a[i] = 1;}

Control dependence also avoids parallelism to being exploited.

Compilers are used to eliminate this control dependence and exploit the parallelism.

Resource dependence Resource independence is concerned with

conflicts in using shared resources, such as registers, integer and floating point ALUs, etc. ALU conflicts are called ALU dependence.

Memory (storage) conflicts are called storage dependence.

Bernstein’s Conditions Bernstein’s conditions are a set of conditions

which must exist if two processes can execute in parallel.

Notation Ii is the set of all input variables for a process

Pi . Ii is also called the read set or domain of Pi. Oi is the set of all output variables for a

process Pi .Oi is also called write set.

If P1 and P2 can execute in parallel (which is written as P1 || P2) then we should have-:

In terms of data dependencies, Bernstein’s conditions imply that two processes can execute in parallel if they are flow-independent, antiindependent, and outputindependent.

The parallelism relation || is commutative (Pi || Pj implies Pj || Pi ), but not transitive (Pi || Pj and Pj || Pk does not imply Pi || Pk )

Hardware and software parallelism Hardware parallelism is defined by machine

architecture. It can be characterized by the number of

instructions that can be issued per machine cycle. If a processor issues k instructions per machine cycle, it is called a k-issue processor.

Conventional processors are one-issue machines.

Examples. Intel i960CA is a three-issue processor (arithmetic, memory access, branch). IBM RS -6000 is a four-issue processor (arithmetic, floating-point, memory access, branch)

Software Parallelism Software parallelism is defined by the control

and data dependence of programs, and is revealed in the program’s flow graph i.e., it is defined by dependencies with in the code and is a function of algorithm, programming style, and compiler optimization.

Mismatch between software and hardware parallelism - 1

L1 L2 L3 L4

X1 X2

+ -

A B

Maximum software parallelism (L=load, X/+/- = arithmetic).

Cycle 1

Cycle 2

Cycle 3

Mismatch between software and hardware parallelism - 2

L1

L2

L4

L3X1

X2

+

-A

B

Cycle 1Cycle 2Cycle 3Cycle 4Cycle 5Cycle 6Cycle 7

Same problem, but considering the parallelism on a two-issue superscalar processor.

Types of Software Parallelism

Control Parallelism – two or more operations can be performed simultaneously. This can be detected by a compiler, or a programmer can explicitly indicate control parallelism by using special language constructs or dividing a program into multiple processes.

Data parallelism – multiple data elements have the same operations applied to them at the same time. This offers the highest potential for concurrency (in SIMD and MIMD modes). Synchronization in SIMD machines handled by hardware.

The Role of Compilers Compilers used to exploit hardware features

to improve performance. Interaction between compiler and architecture

design is a necessity in modern computer development.

It is not necessarily the case that more software parallelism will improve performance in conventional scalar processors.

The hardware and compiler should be designed at the same time.

Program Partitioning & Scheduling

The size of the parts or pieces of a program that can be considered for parallel execution can vary.

The sizes are roughly classified using the term “granule size,” or simply “granularity.”

The simplest measure, for example, is the number of instructions in a program part.

Grain sizes are usually described as fine, medium or coarse, depending on the level of parallelism involved.

Latency Latency is the time required for

communication between different subsystems in a computer.

Memory latency, for example, is the time required by a processor to access memory.

Synchronization latency is the time required for two processes to synchronize their execution.

Levels of Parallelism

Jobs or programs

Instructionsor statements

Non-recursive loopsor unfolded iterations

Procedures, subroutines,tasks, or coroutines

Subprograms, job steps orrelated parts of a program

}}Coarse grain

Medium grain

}Fine grain

Increasing communication demand

and scheduling overhead

Higher degree of

parallelism

Types and Levels Of Parallelism1) Instruction Level Parallelism2) Loop-level Parallelism3) Procedure-level Parallelism4) Subprogram-level Parallelism5) Job or Program-Level Parallelism

Instruction Level Parallelism This fine-grained, or smallest granularity level

typically involves less than 20 instructions per grain.

The number of candidates for parallel execution varies from 2 to thousands, with about five instructions or statements (on the average) being the average level of parallelism.

Advantages: There are usually many candidates for parallel

execution Compilers can usually do a reasonable job of

finding this parallelism

Loop-level Parallelism Typical loop has less than 500 instructions. If a

loop operation is independent between iterations, it can be handled by a pipeline, or by a SIMD machine.

Most optimized program construct to execute on a parallel or vector machine. Some loops (e.g. recursive) are difficult to handle.

Loop-level parallelism is still considered fine grain computation.

Procedure-level Parallelism Medium-sized grain; usually less than 2000

instructions. Detection of parallelism is more difficult than

with smaller grains; interprocedural dependence analysis is difficult.

Communication requirement less than instruction level SPMD (single procedure multiple data) is a special case

Multitasking belongs to this level.

Subprogram-level Parallelism Grain typically has thousands of instructions. Multi programming conducted at this level No compilers available to exploit medium- or

coarse-grain parallelism at present

Job Level Corresponds to execution of essentially

independent jobs or programs on a parallel computer.

This is practical for a machine with a small number of powerful processors, but impractical for a machine with a large number of simple processors (since each processor would take too long to process a single job).

Permutations

Given n objects, there are n ! ways in which they can be reordered (one of which is no reordering).

A permutation can be specified by giving the rule fo reordering a group of objects.

Permutations can be implemented using crossbar switches, multistage networks, shifting, and broadcast operations.

Example(Permutation in Crossbar Switch)

12

3

4

1 2 3 4

input

output

A 4x4 cross-bar

Permutation: (1, 2, 3, 4) -> (3, 1, 2, 4)

1 2 3 4

(1,2, 3, 4)->(3, 1, 2, 4)

12

3

4

Perfect Shuffle and Exchange

Stone suggested the special permutation that entries according to the mapping of the k-bit binary number a b … k to b c … k a (that is, shifting 1 bit to the left and wrapping it around to the least significant bit position).

The inverse perfect shuffle reverses the effect of the perfect shuffle.

Hypercube Routing Functions

If the vertices of a n-dimensional cube are labeled with n-bit numbers so that only one bit differs between each pair of adjacent vertices, then n routing functions are defined by the bits in the node (vertex) address.

For example, with a 3-dimensional cube, we can easily identify routing functions that exchange data between nodes with addresses that differ in the least significant, most significant, or middle bit.

Factors Affecting Performance

Functionality – how the network supports data routing, interrupt handling, synchronization, request/message combining, and coherence

Network latency – worst-case time for a unit message to be transferred

Bandwidth – maximum data rate Hardware complexity – implementation costs for

wire, logic, switches, connectors, etc. Scalability – how easily does the scheme adapt to an

increasing number of processors, memories, etc..

Dynamic Networks – Bus Systems

A bus system (contention bus, time-sharing bus) has a collection of wires and connectors multiple modules (processors, memories, peripherals, etc.)

which connect to the wires data transactions between pairs of modules

Bus supports only one transaction at a time. Bus arbitration logic must deal with conflicting

requests. Lowest cost and bandwidth of all dynamic schemes. Many bus standards are available.

78

2 × 2 Switches

*From Advanced Computer Architectures, K. Hwang, 1993.

79

Single-stage networks

Single stage Shuffle-Exchange IN (left)

Perfect shuffle mapping function (right)

Perfect shuffle operation: cyclic shift 1 place left, eg 101 --> 011

Exchange operation: invert least significant bit, e.g. 101 --> 100

*From Ben Macey at http://www.ee.uwa.edu.au/~maceyb/aca319-2003

Crossbar Switch Connections Has n inputs and m outputs; n and m are usually the same. Data can flow in either directions. Each crosspoint can open or close to realize a connection.

..

.

...0 1 m - 1

0

1

n - 1

...inputs

outputs

All possible combinations can be realized. The inputs are usually connected to processors and outputs connected to memory, I/O, or other processors. These switches have complexities of O(n2); doubling the number of inputs and outputs also doubles the size of the switch. To solve this problem, multistage interconnection networks were developed.

81

Multistage Interconnection Networks The capability of single stage networks are limited but if we cascade enough

of them together, they form a completely connected MIN (Multistage Interconnection Network).

Switches can perform their own routing or can be controlled by a central router

This type of networks can be classified into the following four categories: Nonblocking

A network is called strictly nonblocking if it can connect any idle input to any idle output regardless of what other connections are currently in process

Rearrangeable nonblocking In this case a network should be able to establish all possible connections

between inputs and outputs by rearranging its existing connections. Blocking interconnection

A network is said to be blocking if it can perform many, but not all, possible connections between terminals.

Example: the Omega network

Multistage network

Refer to fig 2.23 on page no 91 from ACA book (KAI HWANG)

Different classes of Multistage Interconnection Networks(MINs) differ in switch module and in the kind of interstage pattern used.

The patterns often include perfect shuffle,butterfly,crossbar,cube connection etc

Omega Network A 2 2 switch can be configured for

Straight-through Crossover Upper broadcast (upper input to both outputs) Lower broadcast (lower input to both outputs) (No output is a somewhat vacuous possibility as well)

With four stages of eight 2 2 switches, and a static perfect shuffle for each of the four ISCs, a 16 by 16 Omega network can be constructed (but not all permutations are possible).

In general , an n-input Omega network requires log 2 n stages of 2 2 switches and n / 2 switch modules.

16 x 16 omega network

8 x 8 Omega Network

A

B

C

J

K

LD

I

H

G

E

F

10

234567

10

234567

Has complexity of O(n lg n).

The fixed links between every pair of stages are identical. A perfect shuffle is formed for the fixed links between every pair of stages. For 8 possible inputs, there are a total of 8! = 40,320 1 to 1 mappings of the inputs onto the outputs. But only 12 switches for a total of 212 = 4096 settings. Thus, network is blocking.

Consists of four 2 x 2 switches per stage.

A

B

C

D H

G

E

F J

K

L

I

Baseline Network

A

C

B

D

10

234567

10

234567

Similar to the Omega network, essentially the front half of a Benes network.

To generalize into an n x n Baseline network, first create one stage of (n / 2) 2 x 2 switches. Then one output from each 2 x 2 switch is connected to an input of each (n / 2) x (n / 2) switch. Then the (n / 2) x (n / 2) switches are replaced by (n / 2) x (n / 2) Baseline networks constructed in the same way.

The figure to the right shows an 8 x 8 Baseline network. J

IE

G

K

LH

F

A

C

B

D

The Baseline and Omega networks are isomorphic with each other.

Isomorphism Between Baseline and Omega Networks (cont.)

If B and C, and F and G are repositioned while keeping the fixed links as the switches are moved.

A

D

10

234567

10

234567

IE

LH

C

B

J

KF

G

Starting with the Baseline network.

B

C

JF

KG

The Baseline network transforms into the Omega network. Therefore, the Baseline and Omega networks are isomorphic.

Recursive Construction

The first stage contains one NXN block and second stage contains 2 (N/2)x (N/2) sub blocks labeled Co and C1.

This construction can be recursively repeated to bub block until 2x2 switch is reached.

Crossbar Networks A crossbar network can be visualized as a

single-stage switch network. Like a telephone switch board, the crosspoint

switches provide dynamic connections between(source, destination) pairs.

Each cross point switch can provide a dedicated connection path between a pair.

The switch can be set on or off dynamically upon program demand.

Shared Memory Crossbar To build a shared-memory multiprocessor, one

can use a crossbar network between the processors and memory modules (Fig. 2.26a).

The C.mmp multiprocessor has implemented a 16 x 16 crossbar network which connects 16 PDP 11 processors to 16 memory modules, each of which has a capability of 1 million words of memory cells.

Shared Memory Crossbar Switch P

roce

ssor

s

Memories

Crossbar Switch

…

…

Shared Memory Crossbar Switch Note that each memory module can satisfy

only one processor request at a time. When multiple requests arrive at the same

memory module simaltaneously,cross bar must resolve the conflicts.

Interprocess Communication Crossbar Switch This large crossbar was actually built in vector

parallel processor. The PEs are the processor with attached

memory. The CPs stand for control processor which are

used to supervise entire system operation.

Interprocess Communication Crossbar Switch

Summary of Crossbar NetworkDifferent types of devices can be connected, yielding different constraints on which switches can be enabled.

• With m processors and n memories, one processor may be able to generate requests for multiple memories in sequence; thus several switches might be set in the same row.

• For m m interprocessor communication, each PE is connected to both an input and an output of the crossbar; only one switch in each row and column can be turned on simultaneously. Additional control processors are used to manage the crossbar itself.

End Of Module 4

Program & network properties

Documents

Transcript of Program & network properties