Parallel Programming

Parallel Programming

Sathish S. VadhiyarCourse Web Page:http://www.serc.iisc.ernet.in/~vss/courses/PPP2009

Motivation for Parallel Programming• Faster Execution time due to non-dependencies

between regions of code• Presents a level of modularity• Resource constraints. Large databases.• Certain class of algorithms lend themselves• Aggregate bandwidth to memory/disk. Increase in data

throughput.• Clock rate improvement in the past decade – 40%• Memory access time improvement in the past decade

– 10%

• Grand challenge problems (more later)

Challenges / Problems in Parallel Algorithms

Building efficient algorithms. Avoiding

Communication delay Idling Synchronization

Challenges

P0

P1

Idle timeComputation

Communication

Synchronization

How do we evaluate a parallel program? Execution time, Tp Speedup, S

S(p, n) = T(1, n) / T(p, n) Usually, S(p, n) < p Sometimes S(p, n) > p (superlinear speedup)

Efficiency, E E(p, n) = S(p, n)/p Usually, E(p, n) < 1 Sometimes, greater than 1

Scalability – Limitations in parallel computing, relation to n and p.

Speedups and efficiency

Ideal p

S

Practical

p

E

Limitations on speedup – Amdahl’s law Amdahl's law states that the performance

improvement to be gained from using some faster mode of execution is limited by the fraction of the time the faster mode can be used.

Overall speedup in terms of fractions of computation time with and without enhancement, % increase in enhancement.

Places a limit on the speedup due to parallelism.

Speedup = 1(fs + (fp/P))

Amdahl’s law Illustration

Courtesy:

http://www.metz.supelec.fr/~dedu/docs/kohPaper/node2.html

http://nereida.deioc.ull.es/html/openmp/pdp2002/sld008.htm

Efficiency

0

0.2

0.4

0.6

0.8

1

0 5 10 15

S = 1 / (s + (1-s)/p)

Amdahl’s law analysis

f P=1 P=4 P=8 P=16 P=32

1.00 1.0 4.00 8.00 16.00 32.00

0.99 1.0 3.88 7.48 13.91 24.43

0.98 1.0 3.77 7.02 12.31 19.75

0.96 1.0 3.57 6.25 10.00 14.29

•For the same fraction, speedup numbers keep moving away from processor size.

•Thus Amdahl’s law is a bit depressing for parallel programming.

•In practice, the number of parallel portions of work has to be large enough to match a given number of processors.

Gustafson’s Law Amdahl’s law – keep the parallel work fixed Gustafson’s law – keep computation time

on parallel processors fixed, change the problem size (fraction of parallel/sequential work) to match the computation time

For a particular number of processors, find the problem size for which parallel time is equal to the constant time

For that problem size, find the sequential time and the corresponding speedup

Thus speedup is scaled or scaled speedup

Metrics (Contd..)

N P=1 P=4 P=8 P=16 P=32

64 1.0 0.80 0.57 0.33

192 1.0 0.92 0.80 0.60

512 1.0 0.97 0.91 0.80

Table 5.1: Efficiency as a function of n and p.

Scalability Efficiency decreases with increasing P;

increases with increasing N How effectively the parallel algorithm can

use an increasing number of processors How the amount of computation performed

must scale with P to keep E constant This function of computation in terms of P is

called isoefficiency function. An algorithm with an isoefficiency function

of O(P) is highly scalable while an algorithm with quadratic or exponential isoefficiency function is poorly scalable

Scalability Analysis – Finite Difference algorithm with 1D decomposition

Hence isoefficiency function = O(P2) since computation is O(N2)

Can be satisfied with N = P, except for small P.

For constant efficiency, a function of P, when substituted for N must satisfy the following relation for increasing P and constant E.

Scalability Analysis – Finite Difference algorithm with 2D decomposition

Hence isoefficiency function = O(P)

Can be satisfied with N = sqroot(P)

2D algorithm is more scalable than 1D

Parallel Algorithm Design

Steps

Decomposition – Splitting the problem into tasks or modules

Mapping – Assigning tasks to processor

Mapping’s contradictory objectives To minimize idle times To reduce communications

Mapping

Static mapping Mapping based on Data partitioning

Applicable to dense matrix computations Block distribution Block-cyclic distribution

Graph partitioning based mapping Applicable for sparse matrix computations

Mapping based on task partitioning

0 0 0 1 1 1 2 2 2

0 1 0 1 2 0 1 22

Based on Task Partitioning Based on task dependency graph

In general the problem is NP complete

0

0 4

0 2 4 6

0 1 2 3 4 5 6 7

Mapping

Dynamic Mapping A process/global memory can hold a set

of tasks Distribute some tasks to all processes Once a process completes its tasks, it

asks the coordinator process for more tasks

Referred to as self-scheduling, work-stealing

Interaction Overheads In spite of the best efforts in mapping,

there can be interaction overheads Due to frequent communications,

exchanging large volume of data, interaction with the farthest processors etc.

Some techniques can be used to minimize interactions

Parallel Algorithm Design - Containing Interaction Overheads

Maximizing data locality Minimizing volume of data exchange

Using higher dimensional mapping Not communicating intermediate results

Minimizing frequency of interactions Minimizing contention and hot spots

Do not use the same communication pattern with the other processes in all the processes

Parallel Algorithm Design - Containing Interaction Overheads

Overlapping computations with interactions Split computations into phases: those that

depend on communicated data (type 1) and those that do not (type 2)

Initiate communication for type 1; During communication, perform type 2

Overlapping interactions with interactions Replicating data or computations

Balancing the extra computation or storage cost with the gain due to less communication

Parallel Algorithm Classification– Types- Models

Parallel Algorithm Types

Divide and conquer Data partitioning / decomposition Pipelining

Divide-and-Conquer

Recursive in structure Divide the problem into sub-problems

that are similar to the original, smaller in size

Conquer the sub-problems by solving them recursively. If small enough, solve them in a straight forward manner

Combine the solutions to create a solution to the original problem

Divide-and-ConquerExample: Merge Sort

Problem: Sort a sequence of n elements

Divide the sequence into two subsequences of n/2 elements each

Conquer: Sort the two subsequences recursively using merge sort

Combine: Merge the two sorted subsequences to produce sorted answer

Partitioning

1. Breaking up the given problem into p independent subproblems of almost equal sizes

2. Solving the p subproblems concurrently

Mostly splitting the input or output into non-overlapping pieces

Example: Matrix multiplication Either the inputs (A or B) or output (C) can

be partitioned.

Pipelining

Occurs with image processing applications where a number of images undergoes a sequence of transformations.

Parallel Algorithm Models Data parallel model

Processes perform identical tasks on different data Task parallel model

Different processes perform different tasks on same or different data – based on task dependency graph

Work pool model Any task can be performed by any process. Tasks are

added to a work pool dynamically Pipeline model

A stream of data passes through a chain of processes – stream parallelism

Parallel Program Classification- Models- Structure- Paradigms

Parallel Program Models Single Program

Multiple Data (SPMD)

Multiple Program Multiple Data (MPMD)

Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/

Parallel Program Structure Types

Master-Worker / parameter sweep / task farming

Embarassingly/pleasingly parallel

Pipleline / systolic / wavefront

Tightly-coupled Workflow

P0

P1 P2 P3 P4

P0 P1 P2 P3 P4

Programming Paradigms Shared memory

model – Threads, OpenMP

Message passing model – MPI

Data parallel model – HPF


Parallel Architectures Classification - Classification - Cache coherence in shared memory platforms - Interconnection networks

Classification of Architectures – Flynn’s classification Single Instruction

Single Data (SISD): Serial Computers

Single Instruction Multiple Data (SIMD)

- Vector processors and processor arrays

- Examples: CM-2, Cray-90, Cray YMP, Hitachi 3600


Classification of Architectures – Flynn’s classification Multiple Instruction

Single Data (MISD): Not popular

Multiple Instruction Multiple Data (MIMD)

- Most popular - IBM SP and most other

supercomputers, clusters,

computational Grids etc.


Classification of Architectures – Based on Memory

Shared memory 2 types – UMA and

NUMA

UMA

NUMA

Examples: HP-Exemplar, SGI Origin, Sequent NUMA-Q


Classification of Architectures – Based on Memory

Distributed memory


Recently multi-cores Yet another classification – MPPs, NOW (Berkeley), COW, Computational Grids

Cache Coherence- for details, read 2.4.6 of book

Interconnection networks- for details, read 2.4.2-2.4.5 of book

Cache Coherence in SMPs

Main

Memory

CPU0 CPU1 CPU2 CPU3

cache0 cache1 cache2 cache3

a

a a a a

• All processes read variable ‘x’ residing in cache line ‘a’

•Each process updates ‘x’ at different points of time

Challenge: To maintain consistent view of the data

Protocols:

•Write update

•Write invalidate

Caches Coherence Protocols and Implementations

Write update – propagate cache line to other processors on every write to a processor

Write invalidate – each processor get the updated cache line whenever it reads stale data

Which is better??

Caches –False sharing

Main

Memory

CPU0 CPU1

cache1

A0 – A8

A1, A3, A5…

A9 – A15

cache0

A0, A2, A4…

•Modify the algorithm to change the stride

•Different processors update different parts of the same cache line

•Leads to ping-pong of cache lines between processors

•Situation better in update protocols than invalidate protocols. Why?

Caches Coherence using invalidate protocols 3 states associated with data items

Shared – a variable shared by 2 caches Invalid – another processor (say P0) has updated the data

item Dirty – state of the data item in P0

Implementations Snoopy

for bus based architectures Memory operations are propagated over the bus and snooped Instead of broadcasting memory operations to all processors,

propagate coherence operations to relevant processors Directory-based

A central directory maintains states of cache blocks, associated processors

Implemented with presence bits

Interconnection Networks An interconnection network defined by

switches, links and interfaces Switches – provide mapping between input

and output ports, buffering, routing etc. Interfaces – connects nodes with network

Network topologies Static – point-to-point communication links

among processing nodes Dynamic – Communication links are

formed dynamically by switches

Interconnection Networks Static

Bus – SGI challenge Completely connected Star Linear array, Ring (1-D torus) Mesh – Intel ASCI Red (2-D) , Cray T3E (3-D), 2DTorus k-d mesh: d dimensions with k nodes in each dimension Hypercubes – 2-logp mesh – e.g. many MIMD machines Trees – our campus network

Dynamic – Communication links are formed dynamically by switches Crossbar – Cray X series – non-blocking network Multistage – SP2 – blocking network.

For more details, and evaluation of topologies, refer to book

Evaluating Interconnection topologies Diameter – maximum distance between any two

processing nodes Full-connected – Star – Ring – Hypercube -

Connectivity – multiplicity of paths between 2 nodes. Maximum number of arcs to be removed from network to break it into two disconnected networks Linear-array – Ring – 2-d mesh – 2-d mesh with wraparound – D-dimension hypercubes –

12

p/2logP

12

24

d

Evaluating Interconnection topologies

bisection width – minimum number of links to be removed from network to partition it into 2 equal halves Ring – P-node 2-D mesh - Tree – Star – Completely connected – Hypercubes -

2

Root(P)

1

1

P2/4

P/2

Evaluating Interconnection topologies channel width – number of bits that can be

simultaneously communicated over a link, i.e. number of physical wires between 2 nodes

channel rate – performance of a single physical wire

channel bandwidth – channel rate times channel width

bisection bandwidth – maximum volume of communication between two halves of network, i.e. bisection width times channel bandwidth

Parallel Programming

Documents

Transcript of Parallel Programming