Parallel Programming

49
Parallel Programming Sathish S. Vadhiyar Course Web Page: http://www.serc.iisc.ernet.in/~vss/courses/ PPP2009

description

Parallel Programming. Sathish S. Vadhiyar Course Web Page: http://www.serc.iisc.ernet.in/~vss/courses/PPP2009. Motivation for Parallel Programming. Faster Execution time due to non-dependencies between regions of code Presents a level of modularity Resource constraints. Large databases. - PowerPoint PPT Presentation

Transcript of Parallel Programming

Page 1: Parallel Programming

Parallel Programming

Sathish S. VadhiyarCourse Web Page:http://www.serc.iisc.ernet.in/~vss/courses/PPP2009

Page 2: Parallel Programming

Motivation for Parallel Programming• Faster Execution time due to non-dependencies

between regions of code• Presents a level of modularity• Resource constraints. Large databases.• Certain class of algorithms lend themselves• Aggregate bandwidth to memory/disk. Increase in data

throughput.• Clock rate improvement in the past decade – 40%• Memory access time improvement in the past decade

– 10%

• Grand challenge problems (more later)

Page 3: Parallel Programming

Challenges / Problems in Parallel Algorithms

Building efficient algorithms. Avoiding

Communication delay Idling Synchronization

Page 4: Parallel Programming

Challenges

P0

P1

Idle timeComputation

Communication

Synchronization

Page 5: Parallel Programming

How do we evaluate a parallel program? Execution time, Tp Speedup, S

S(p, n) = T(1, n) / T(p, n) Usually, S(p, n) < p Sometimes S(p, n) > p (superlinear speedup)

Efficiency, E E(p, n) = S(p, n)/p Usually, E(p, n) < 1 Sometimes, greater than 1

Scalability – Limitations in parallel computing, relation to n and p.

Page 6: Parallel Programming

Speedups and efficiency

Ideal p

S

Practical

p

E

Page 7: Parallel Programming

Limitations on speedup – Amdahl’s law Amdahl's law states that the performance

improvement to be gained from using some faster mode of execution is limited by the fraction of the time the faster mode can be used.

Overall speedup in terms of fractions of computation time with and without enhancement, % increase in enhancement.

Places a limit on the speedup due to parallelism.

Speedup = 1(fs + (fp/P))

Page 8: Parallel Programming

Amdahl’s law Illustration

Courtesy:

http://www.metz.supelec.fr/~dedu/docs/kohPaper/node2.html

http://nereida.deioc.ull.es/html/openmp/pdp2002/sld008.htm

Efficiency

0

0.2

0.4

0.6

0.8

1

0 5 10 15

S = 1 / (s + (1-s)/p)

Page 9: Parallel Programming

Amdahl’s law analysis

f P=1 P=4 P=8 P=16 P=32

1.00 1.0 4.00 8.00 16.00 32.00

0.99 1.0 3.88 7.48 13.91 24.43

0.98 1.0 3.77 7.02 12.31 19.75

0.96 1.0 3.57 6.25 10.00 14.29

•For the same fraction, speedup numbers keep moving away from processor size.

•Thus Amdahl’s law is a bit depressing for parallel programming.

•In practice, the number of parallel portions of work has to be large enough to match a given number of processors.

Page 10: Parallel Programming

Gustafson’s Law Amdahl’s law – keep the parallel work fixed Gustafson’s law – keep computation time

on parallel processors fixed, change the problem size (fraction of parallel/sequential work) to match the computation time

For a particular number of processors, find the problem size for which parallel time is equal to the constant time

For that problem size, find the sequential time and the corresponding speedup

Thus speedup is scaled or scaled speedup

Page 11: Parallel Programming

Metrics (Contd..)

N P=1 P=4 P=8 P=16 P=32

64 1.0 0.80 0.57 0.33

192 1.0 0.92 0.80 0.60

512 1.0 0.97 0.91 0.80

Table 5.1: Efficiency as a function of n and p.

Page 12: Parallel Programming

Scalability Efficiency decreases with increasing P;

increases with increasing N How effectively the parallel algorithm can

use an increasing number of processors How the amount of computation performed

must scale with P to keep E constant This function of computation in terms of P is

called isoefficiency function. An algorithm with an isoefficiency function

of O(P) is highly scalable while an algorithm with quadratic or exponential isoefficiency function is poorly scalable

Page 13: Parallel Programming

Scalability Analysis – Finite Difference algorithm with 1D decomposition

Hence isoefficiency function = O(P2) since computation is O(N2)

Can be satisfied with N = P, except for small P.

For constant efficiency, a function of P, when substituted for N must satisfy the following relation for increasing P and constant E.

Page 14: Parallel Programming

Scalability Analysis – Finite Difference algorithm with 2D decomposition

Hence isoefficiency function = O(P)

Can be satisfied with N = sqroot(P)

2D algorithm is more scalable than 1D

Page 15: Parallel Programming

Parallel Algorithm Design

Page 16: Parallel Programming

Steps

Decomposition – Splitting the problem into tasks or modules

Mapping – Assigning tasks to processor

Mapping’s contradictory objectives To minimize idle times To reduce communications

Page 17: Parallel Programming

Mapping

Static mapping Mapping based on Data partitioning

Applicable to dense matrix computations Block distribution Block-cyclic distribution

Graph partitioning based mapping Applicable for sparse matrix computations

Mapping based on task partitioning

0 0 0 1 1 1 2 2 2

0 1 0 1 2 0 1 22

Page 18: Parallel Programming

Based on Task Partitioning Based on task dependency graph

In general the problem is NP complete

0

0 4

0 2 4 6

0 1 2 3 4 5 6 7

Page 19: Parallel Programming

Mapping

Dynamic Mapping A process/global memory can hold a set

of tasks Distribute some tasks to all processes Once a process completes its tasks, it

asks the coordinator process for more tasks

Referred to as self-scheduling, work-stealing

Page 20: Parallel Programming

Interaction Overheads In spite of the best efforts in mapping,

there can be interaction overheads Due to frequent communications,

exchanging large volume of data, interaction with the farthest processors etc.

Some techniques can be used to minimize interactions

Page 21: Parallel Programming

Parallel Algorithm Design - Containing Interaction Overheads

Maximizing data locality Minimizing volume of data exchange

Using higher dimensional mapping Not communicating intermediate results

Minimizing frequency of interactions Minimizing contention and hot spots

Do not use the same communication pattern with the other processes in all the processes

Page 22: Parallel Programming

Parallel Algorithm Design - Containing Interaction Overheads

Overlapping computations with interactions Split computations into phases: those that

depend on communicated data (type 1) and those that do not (type 2)

Initiate communication for type 1; During communication, perform type 2

Overlapping interactions with interactions Replicating data or computations

Balancing the extra computation or storage cost with the gain due to less communication

Page 23: Parallel Programming

Parallel Algorithm Classification– Types- Models

Page 24: Parallel Programming

Parallel Algorithm Types

Divide and conquer Data partitioning / decomposition Pipelining

Page 25: Parallel Programming

Divide-and-Conquer

Recursive in structure Divide the problem into sub-problems

that are similar to the original, smaller in size

Conquer the sub-problems by solving them recursively. If small enough, solve them in a straight forward manner

Combine the solutions to create a solution to the original problem

Page 26: Parallel Programming

Divide-and-ConquerExample: Merge Sort

Problem: Sort a sequence of n elements

Divide the sequence into two subsequences of n/2 elements each

Conquer: Sort the two subsequences recursively using merge sort

Combine: Merge the two sorted subsequences to produce sorted answer

Page 27: Parallel Programming

Partitioning

1. Breaking up the given problem into p independent subproblems of almost equal sizes

2. Solving the p subproblems concurrently

Mostly splitting the input or output into non-overlapping pieces

Example: Matrix multiplication Either the inputs (A or B) or output (C) can

be partitioned.

Page 28: Parallel Programming

Pipelining

Occurs with image processing applications where a number of images undergoes a sequence of transformations.

Page 29: Parallel Programming

Parallel Algorithm Models Data parallel model

Processes perform identical tasks on different data Task parallel model

Different processes perform different tasks on same or different data – based on task dependency graph

Work pool model Any task can be performed by any process. Tasks are

added to a work pool dynamically Pipeline model

A stream of data passes through a chain of processes – stream parallelism

Page 30: Parallel Programming

Parallel Program Classification- Models- Structure- Paradigms

Page 31: Parallel Programming

Parallel Program Models Single Program

Multiple Data (SPMD)

Multiple Program Multiple Data (MPMD)

Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/

Page 32: Parallel Programming

Parallel Program Structure Types

Master-Worker / parameter sweep / task farming

Embarassingly/pleasingly parallel

Pipleline / systolic / wavefront

Tightly-coupled Workflow

P0

P1 P2 P3 P4

P0 P1 P2 P3 P4

Page 33: Parallel Programming

Programming Paradigms Shared memory

model – Threads, OpenMP

Message passing model – MPI

Data parallel model – HPF

Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/

Page 34: Parallel Programming

Parallel Architectures Classification - Classification - Cache coherence in shared memory platforms - Interconnection networks

Page 35: Parallel Programming

Classification of Architectures – Flynn’s classification Single Instruction

Single Data (SISD): Serial Computers

Single Instruction Multiple Data (SIMD)

- Vector processors and processor arrays

- Examples: CM-2, Cray-90, Cray YMP, Hitachi 3600

Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/

Page 36: Parallel Programming

Classification of Architectures – Flynn’s classification Multiple Instruction

Single Data (MISD): Not popular

Multiple Instruction Multiple Data (MIMD)

- Most popular - IBM SP and most other

supercomputers, clusters,

computational Grids etc.

Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/

Page 37: Parallel Programming

Classification of Architectures – Based on Memory

Shared memory 2 types – UMA and

NUMA

UMA

NUMA

Examples: HP-Exemplar, SGI Origin, Sequent NUMA-Q

Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/

Page 38: Parallel Programming

Classification of Architectures – Based on Memory

Distributed memory

Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/

Recently multi-cores Yet another classification – MPPs, NOW (Berkeley), COW, Computational Grids

Page 39: Parallel Programming

Cache Coherence- for details, read 2.4.6 of book

Interconnection networks- for details, read 2.4.2-2.4.5 of book

Page 40: Parallel Programming

Cache Coherence in SMPs

Main

Memory

CPU0 CPU1 CPU2 CPU3

cache0 cache1 cache2 cache3

a

a a a a

• All processes read variable ‘x’ residing in cache line ‘a’

•Each process updates ‘x’ at different points of time

Challenge: To maintain consistent view of the data

Protocols:

•Write update

•Write invalidate

Page 41: Parallel Programming

Caches Coherence Protocols and Implementations

Write update – propagate cache line to other processors on every write to a processor

Write invalidate – each processor get the updated cache line whenever it reads stale data

Which is better??

Page 42: Parallel Programming

Caches –False sharing

Main

Memory

CPU0 CPU1

cache1

A0 – A8

A1, A3, A5…

A9 – A15

cache0

A0, A2, A4…

•Modify the algorithm to change the stride

•Different processors update different parts of the same cache line

•Leads to ping-pong of cache lines between processors

•Situation better in update protocols than invalidate protocols. Why?

Page 43: Parallel Programming

Caches Coherence using invalidate protocols 3 states associated with data items

Shared – a variable shared by 2 caches Invalid – another processor (say P0) has updated the data

item Dirty – state of the data item in P0

Implementations Snoopy

for bus based architectures Memory operations are propagated over the bus and snooped Instead of broadcasting memory operations to all processors,

propagate coherence operations to relevant processors Directory-based

A central directory maintains states of cache blocks, associated processors

Implemented with presence bits

Page 44: Parallel Programming

Interconnection Networks An interconnection network defined by

switches, links and interfaces Switches – provide mapping between input

and output ports, buffering, routing etc. Interfaces – connects nodes with network

Network topologies Static – point-to-point communication links

among processing nodes Dynamic – Communication links are

formed dynamically by switches

Page 45: Parallel Programming

Interconnection Networks Static

Bus – SGI challenge Completely connected Star Linear array, Ring (1-D torus) Mesh – Intel ASCI Red (2-D) , Cray T3E (3-D), 2DTorus k-d mesh: d dimensions with k nodes in each dimension Hypercubes – 2-logp mesh – e.g. many MIMD machines Trees – our campus network

Dynamic – Communication links are formed dynamically by switches Crossbar – Cray X series – non-blocking network Multistage – SP2 – blocking network.

For more details, and evaluation of topologies, refer to book

Page 46: Parallel Programming

Evaluating Interconnection topologies Diameter – maximum distance between any two

processing nodes Full-connected – Star – Ring – Hypercube -

Connectivity – multiplicity of paths between 2 nodes. Maximum number of arcs to be removed from network to break it into two disconnected networks Linear-array – Ring – 2-d mesh – 2-d mesh with wraparound – D-dimension hypercubes –

12

p/2logP

12

24

d

Page 47: Parallel Programming

Evaluating Interconnection topologies

bisection width – minimum number of links to be removed from network to partition it into 2 equal halves Ring – P-node 2-D mesh - Tree – Star – Completely connected – Hypercubes -

2

Root(P)

1

1

P2/4

P/2

Page 48: Parallel Programming

Evaluating Interconnection topologies channel width – number of bits that can be

simultaneously communicated over a link, i.e. number of physical wires between 2 nodes

channel rate – performance of a single physical wire

channel bandwidth – channel rate times channel width

bisection bandwidth – maximum volume of communication between two halves of network, i.e. bisection width times channel bandwidth

Page 49: Parallel Programming

END