1 Parallel Computing 5 Parallel Application Design Ondřej Jakl Institute of Geonics, Academy of...

1

Parallel Computing 5

Parallel Application Design Ondřej Jakl

Institute of Geonics, Academy of Sci. of the CR

2

• Task/channel model

• Foster’s design methodology

• Partitioning

• Communication analysis

• Agglomeration

• Mapping to processors

• Examples

Outline of the lecture

3

• In general a very creative process

• Only methodical frameworks available

• Usually more alternatives to be considered

• The best parallel solution may differ from suggestions of the sequential approach

Design of parallel algorithms

4

• Introduced in Ian Foster’s Designing and Building Parallel Programs [Foster 1995]

– http://www-unix.mcs.anl.gov/dbpp

• Represents a parallel computation as set of tasks– task is a program, its local memory and a collection of I/O ports

• task can send local data values to other tasks via output ports

• task can receive data values from other tasks via input ports

• The tasks may interact with each other by sending messages through channels

– channel is a message queue that connects one task’s output port with another task’s input port

– nonblocking asynchronous send and blocking receive is supposed

• An abstraction close to the message passing model

Task/channel model (1)

5

Task/channel model (2)

after [Quinn 2004]TaskChannel

Program

Output port

Input port

Directed graph of tasks (vertices) and channels (edges)

6

• Design stages:1. partitioning into concurrent tasks2. communication analysis to

coordinate tasks3. agglomeration into larger tasks

with respect to the target platform 4. mapping of tasks to processors

• 1, 2 conceptual level, 3, 4 implementation dependent

• In practice often considered simultaneously

Foster’s methodology [Foster 1995]

7

• Process of dividing the computation and the data into pieces – primitive tasks• Goal: Expose the opportunities for parallel

processing• Maximal (fine-grained) decomposition

for greater flexibility• Complementary techniques:

– domain decomposition (data centric approach)

– functional decomposition (computation centric approach)

• Combinations possible– usual scenario:

primary decomposition – functional secondary decomposition – domain

Partitioning (decomposition)

8

• Primary object of decomposition: processed data – first, data associated with the problem is divided into pieces

• focus on the largest and/or most frequently accessed data• pieces should be of comparable size

– next, the computation is partitioned according to the data on which it operates

• usually the same code for each task (SPMD – Single Program Multiple Data)• may be non-trivial, may bring up complex mathematical problems

• Most often used technique in parallel programming

Domain (data) decomposition

3D grid data: one-, two-, three-dimensional decomposition [Foster 1995]

9

• Primary object of decomposition: computation– first, computation is decomposed into disjoint tasks

• different codes of the tasks (MPMD – Multiple Program Multiple Data)

• methodological benefits: implies program structuring– gives rise to simpler modules with interfaces– c.f. object oriented programming, etc.

– next, data is partitioned according to the requirements of the tasks• data requirements may be disjoint, or overlap ( communication)

• Sources of parallelism: – concurrent processing of independent tasks– concurrent processing of a stream of data through pipelining

• a stream of data is passed on through a succession of tasks, each of which perform some operation on it

• MPSD – Multiple Program Single Data

• The number of task usually does not scale with the problem size – for greater scalability combine with domain decomposition on the subtasks

Functional (task) decompositionClimate model [Foster 1995]

10

• More tasks (at least by order of magnitude) then processors

– if not: little flexibility

• No redundancy in processing and data– if not: little scalability

• Comparable size of tasks– if not: difficult load balancing

• Number of task proportional to the size of the problem– if not: problems utilizing additional processors

• Alternate partitions available?

Good decomposition

11

• Calculation of π by the standard numerical integration formula

• Consider numerical integration based on the rectangle method

– integral is approximated by the area of evenly spaced rectangular strips

– height of the strips is calculated as the value of the integrated function at the midpoint of the strips

Example: PI calculation

1.00.0

2.0

4.0

F(x

) = 4

/(1+

x2)

x

dxx

1

02

1

4

12

Seqential pseudocode

set n (number of strips)for each strip

calculate the height y of the strip (rectangle) at its midpointsum all y to the result S

endformultiply S by the width of the stripsprint result

PI calculation – sequential algorithm

13

Parallel pseudocode (for the task/channel model)

if master thenset n (number of strips)send n to the workers

else // workerreceive n from the master

endiffor each strip assigned to this task

calculate the height y of the strip (rectangle) at midpointsum all y to the (partial) result S

endforif master then

receive S from all workerssum all S and multiply by the width of the stripsprint result

else // workersend S to the master

endif

PI calculation – parallel algorithm

14

• Domain decomposition:– primitive task– calculation of one strip height

• Functional decomposition:– manager task: controls the computation

worker task(s): perform the main calculation• manager/worker technique (also called control decomposition)

• more or less technical decomposition

• A perfectly/embarrassingly parallel problem: the (worker) processes are (almost) independent

Parallel PI calculation – partitioning

15

• Determination of the communication pattern among the primitive tasks

• Goal: Expose the information flow

• The tasks generated by partitioning are as a rule not independent – they cooperate by exchanging data

• Communication means overhead – minimize!

– not included in the sequential algorithm

• Efficient communication may be difficult to organize

– especially in domain-decomposed problems

Communication analysis

16

Cathegorizationlocal: between small number of “neighbours”global: many “distant” tasks participatestructured: regular and repeated communication patterns in place

and time unstructured: communication networks are arbitrary graphsstatic: communication partners do not change over timedynamic: communication depends on the computation history and

changes at runtimesynchronous: communication partners cooperate in data transfer

operationsasynchronous: producers are not able to determine data requests of

consumers

The first items are to be preferred in parallel programs

Parallel communication

17

• Preferably no communication involved in parallel algorithm– if not: overhead decreasing parallel efficiency

• Tasks have comparable communication demands– if not: little scalability

• Tasks communicate only with a small number of neighbours– if not: loss of parallel efficiency

• Communication operations and computation in different tasks can proceed concurrentlycommunication and computation can overlap

– if not: inefficient and nonscalable algorithm

Good communication

18

Example: Jacobi differences

8

(t)X (t)X (t)X (t)X (t)X 4 1)(tX 1ji,1-ji,j1,ij1,-iji,

ji,

Jacobi finite difference method

•Repeated update (in timesteps) of values assigned to points of a multidimensional grid

•In 2-D, the grid point i, j may get in timestep t+1 a value given by the formula (weighted mean)

[Foster 1995]

19

Jacobi: parallel algorithm

• Decomposition (domain):– primitive task – calculation of the weighted

mean in one grid point

• Parallel code main loopfor each timestep t

send Xi,j(t) to each neighbour

receive Xi-1,j(t), Xi+1,j(t), Xi,j-1(t), Xi,j+1(t) from neighbours

calculate Xi,j(t+1)

endfor

• Communication: – communication channels between

neighbours– local, structured, static, synchronous

20

Example: Gauss-Seidel scheme

8

(t)X 1)(tX (t)X 1)(tX (t)X 4 1)(tX 1ji,1-ji,j1,ij1,-iji,

ji,

• More efficient in sequential computing

• Not easy to parallelize

[Foster 1995]

21

• Process of gouping primitive tasks into larger tasks • Goal: revision of the (abstract, conceptual)

partitioning and communication to improve performance

– choose granularity appropriate to the target parallel computer

• Large number of fine-grained tasks tend to be inefficient because of great

• communication cost• task creation cost

– spawn operation rather expensive

(and to simplify programming demands)• Agglomeration increases granularity

– potential conflict with retaining flexibility and scalability [next slides]

• Closely related with mapping to processors

Agglomeration

22

• Measure characterizing the size and quantity of tasks• Increasing granularity by combining several tasks into larger ones

– reduces communication cost• less communication (a)• fewer, but larger messages (b)

– reduces task creation cost • less processes

• Agglomerate tasks that– frequently communicate with each other

• increases locality

– cannot execute concurrently

• Consider also [next slides]– surface-to-volume effects– replication of computation/data

Agglomeration & granularity

[Quinn 2004]

23

• The communication/computation ratio decreases with increasing granularity:

– computation cost is proportional to the “volume” of the subdomain

– communication cost is proportional to the “surface”

• Agglomeration in all dimension is most efficient – reduces surface for given volume

– in practice is more difficult to code

• Difficult with unstructured communication

• Ex.: Jacobi finite differences [next slide]

Surface-to-volume effects (1)

24

Surface-to-volume effects (2)

[Foster 1995]

No agglomeration:

41

14

comp

comm

N

N1

16

44

comp

comm

N

N

Agglomeration 4 x 4:

Ex.: Jacobi finite differences – agglomeration

>

25

• Ability to make use of diverse computing environments– good parallel programs are resilient to changes in processor count

• scalability - ability to employ increasing number of tasks

• Too coarse granularity reduces flexibility• Usual practical design: agglomerate one task per processor

– can be controlled by a compile-time or runtime parameters– with some MPS (PVM, MPI-2) on-the-fly (dynamic spawn)

• But consider also creating more tasks than processors:– when tasks often wait for remote data: several tasks mapped to one

processor permit overlapping computation and communication– greater scope for mapping strategies that balance computational load over

available processors• a rule of thumb: an order of magnitude more tasks

• Optimal number of tasks: determined by a combination of analytic modelling and empirical studies

Agglomeration & flexibility

• To reduce communication requirements, the same computation is repeated in several tasks

– compute once & distribute vs. compute repeatedly & don’t communicate – a trade off

• Redundant computation pays off when its computational cost is less then the communication cost

– moreover it removes dependences• Ex.: summation of numbers (located on separate processors) with

distribution of the result

Replicating computation

s s s

d d d

s

sss

Without replication: 2(n – 1) steps

•(n – 1) additions• necessary minimum

With replication: (n – 1) steps

•n (n – 1) additions• (n – 1)2 redundant

27

• Increased locality of communication

• Beneficial replication of computation

• Replication of data does not compromise scalability

• Similar computation and communication costs of the agglomerated tasks

• Number of tasks can scale with the problem size

• Fewer larger-grained tasks is usually more efficient than more fine-grained tasks

Good agglomeration

28

• Process of assigning (agglomerated) tasks to processors for execution

• Goal: Maximize processor utilization, minimize interprocessor communication

– load balancing• Concerns multicomputers only

– multiprocessors: automatic task scheduling• Guidelines to minimize execution time:

– concurrent task place on different processors (increase concurrency)

– tasks with frequent communication place on the same processor (enhance locality)

• Optimal mapping is generally an NP-complete problem

– strategies, heuristics for special classes of problems available

Mappingco

nfli

ctin

g

29

Basic mapping strategies

[Quinn 2004]

30

• Mapping strategy with the aim to keep all processors busy during the execution of the parallel program

– minimization of the idle time • In heterogeneous computing environment

every parallel application may need (dynamic) load balancing

• Static load balancing– performed before the program

enters the solution phase • Dynamic load balancing

– needed when task created/destroyed at run-time and/or comm./comp requirements of tasks vary widely

– invoked occasionally during the execution of the parallel program• analyses the current computation and rebalances it• may imply significant overhead!

Load balancing

Bad load balancing [LLNL 2010] barrier

31

• Most appropriate for domain decomposed problems

• Representative examples [next slides]

– recursive bisection

– probabilistic methods

– local algorithms

Load-balancing algorithms

32

• Recursive cuts into subdomains of nearly equal computational cost while attempting to minimize communication

– allows the partitioning algorithm itself to be executed in parallel

Recursive bisection

Coordinate bisection:• for irregular grids with local

communication• cuts into halves based on

physical coordinates of grid points

• simple, but does care for communication

• unbalanced bisection: does not necessarily divide into halves

• to reduce communication• a lot of variants

• e.g. recursive graph bisection

Irregular grid for a superconductivity simulation [Foster 1995]

33

• Allocate tasks randomly on processors– about the same computation load can be expected for

large number of tasks• typically at least ten times as many tasks as processors

required

• Communication is usually not considered– appropriate for tasks with little communication and/or

little locality in communication• Simple, low cost, scalable• Variant: cyclic mapping for spatial locality in load

levels– each of p processors is allocated every pth task

• Variant: block cyclic distributions– blocks of tasks are allocated to processors

Probabilistic methods

1.00.0

2.0

4.0F

(x) =

4/(1

+x2

)

x

proc. #1

proc. #2

proc. #3

34

• Compensate for changes in computational load using only local information obtained from a small number of “neighbouring” tasks

– do not require expensive global knowledge of computational state

• If imbalance exists (threshold), some computation load is transferred to the less loaded neighbour

• Simple, but less efficient then global algorithms– slow when adjusting major

changes in load characteristics

• Advantageous for dynamic load balancing

Local algorithms

Local algorithm for a grid problem [Foster 1995]

35

• Suitable for a pool of independent tasks– represent stand-alone problems, contain solution code + data

– can be conceived as special kind of data

• Often obtained from functional decomposition – many tasks with weak locality

• Centralized or distributed variants

• Dynamic load balancing by default

• Examples:

– (hierarchical) manager/worker

– decentralized schemes

Task-scheduling algorithms

36

• Simple task scheduling scheme – sometimes called “master/slave”

• Central manager task is responsible for problem allocation– maintains a pool (queue) of problems

• e.g. a search in a particular tree branch

• Workers run on separate processors and repeatedly request and solve assigned problems

– may also sent new problems to the manager

• Efficiency: – consider cost of problem transfer

• prefetching, caching applicable

– manager must not become a bottleneck• Hierarchical manager/worker variant

– introduces a layer of submanagers responsible for subset of workers

Manager/worker

[Wilkinson 1999]

37

• Task-scheduling without global management

• Task pool is a data structure distributed among many processors

• The pool is accessed asynchronously by idle workers– various access polices: neighbours, by random, etc.

• Termination detection may be difficult

Decentralized schemes

38

• In general: Try to balance conflicting requirements for equitable load distribution and low communication cost

• When possible, use static mapping allocating each process to a single processor

• Dynamic load balancing / task scheduling can be appropriate when the number or size of tasks is variable or not known until runtime

• With centralized load-balancing schemes verify that the manager will not become a bottleneck

• Consider implementation cost

Good mapping

39

• Foster’s design methodology is conveniently applicable– in [Quinn 2004] made use of for the design of many parallel

programs in MPI (OpenMP)

• In practice, all phases often considered in parallel

• In bad practice, conceptual phases skipped – machine-dependent design from the very beginning

• Some kind of a “life-belt” (“fix point”) when the development comes into troubles

Conclusions

40

Further study

• [Foster 1995] Designing and Building Parallel Programs

• [Quinn 2004] Parallel Programming in C with MPI and OpenMP

• In most textbooks a chapter like “Principles of parallel algorithm design”

– often concentrated on the mapping step

45

Example tree search

1 Parallel Computing 5 Parallel Application Design Ondřej Jakl Institute of Geonics, Academy of...

Documents

Transcript of 1 Parallel Computing 5 Parallel Application Design Ondřej Jakl Institute of Geonics, Academy of...