Parallel Programming
description
Transcript of Parallel Programming
Parallel Programming
Sathish S. VadhiyarCourse Web Page:http://www.serc.iisc.ernet.in/~vss/courses/PPP2009
Motivation for Parallel Programming• Faster Execution time due to non-dependencies
between regions of code• Presents a level of modularity• Resource constraints. Large databases.• Certain class of algorithms lend themselves• Aggregate bandwidth to memory/disk. Increase in data
throughput.• Clock rate improvement in the past decade – 40%• Memory access time improvement in the past decade
– 10%
• Grand challenge problems (more later)
Challenges / Problems in Parallel Algorithms
Building efficient algorithms. Avoiding
Communication delay Idling Synchronization
Challenges
P0
P1
Idle timeComputation
Communication
Synchronization
How do we evaluate a parallel program? Execution time, Tp Speedup, S
S(p, n) = T(1, n) / T(p, n) Usually, S(p, n) < p Sometimes S(p, n) > p (superlinear speedup)
Efficiency, E E(p, n) = S(p, n)/p Usually, E(p, n) < 1 Sometimes, greater than 1
Scalability – Limitations in parallel computing, relation to n and p.
Speedups and efficiency
Ideal p
S
Practical
p
E
Limitations on speedup – Amdahl’s law Amdahl's law states that the performance
improvement to be gained from using some faster mode of execution is limited by the fraction of the time the faster mode can be used.
Overall speedup in terms of fractions of computation time with and without enhancement, % increase in enhancement.
Places a limit on the speedup due to parallelism.
Speedup = 1(fs + (fp/P))
Amdahl’s law Illustration
Courtesy:
http://www.metz.supelec.fr/~dedu/docs/kohPaper/node2.html
http://nereida.deioc.ull.es/html/openmp/pdp2002/sld008.htm
Efficiency
0
0.2
0.4
0.6
0.8
1
0 5 10 15
S = 1 / (s + (1-s)/p)
Amdahl’s law analysis
f P=1 P=4 P=8 P=16 P=32
1.00 1.0 4.00 8.00 16.00 32.00
0.99 1.0 3.88 7.48 13.91 24.43
0.98 1.0 3.77 7.02 12.31 19.75
0.96 1.0 3.57 6.25 10.00 14.29
•For the same fraction, speedup numbers keep moving away from processor size.
•Thus Amdahl’s law is a bit depressing for parallel programming.
•In practice, the number of parallel portions of work has to be large enough to match a given number of processors.
Gustafson’s Law Amdahl’s law – keep the parallel work fixed Gustafson’s law – keep computation time
on parallel processors fixed, change the problem size (fraction of parallel/sequential work) to match the computation time
For a particular number of processors, find the problem size for which parallel time is equal to the constant time
For that problem size, find the sequential time and the corresponding speedup
Thus speedup is scaled or scaled speedup
Metrics (Contd..)
N P=1 P=4 P=8 P=16 P=32
64 1.0 0.80 0.57 0.33
192 1.0 0.92 0.80 0.60
512 1.0 0.97 0.91 0.80
Table 5.1: Efficiency as a function of n and p.
Scalability Efficiency decreases with increasing P;
increases with increasing N How effectively the parallel algorithm can
use an increasing number of processors How the amount of computation performed
must scale with P to keep E constant This function of computation in terms of P is
called isoefficiency function. An algorithm with an isoefficiency function
of O(P) is highly scalable while an algorithm with quadratic or exponential isoefficiency function is poorly scalable
Scalability Analysis – Finite Difference algorithm with 1D decomposition
Hence isoefficiency function = O(P2) since computation is O(N2)
Can be satisfied with N = P, except for small P.
For constant efficiency, a function of P, when substituted for N must satisfy the following relation for increasing P and constant E.
Scalability Analysis – Finite Difference algorithm with 2D decomposition
Hence isoefficiency function = O(P)
Can be satisfied with N = sqroot(P)
2D algorithm is more scalable than 1D
Parallel Algorithm Design
Steps
Decomposition – Splitting the problem into tasks or modules
Mapping – Assigning tasks to processor
Mapping’s contradictory objectives To minimize idle times To reduce communications
Mapping
Static mapping Mapping based on Data partitioning
Applicable to dense matrix computations Block distribution Block-cyclic distribution
Graph partitioning based mapping Applicable for sparse matrix computations
Mapping based on task partitioning
0 0 0 1 1 1 2 2 2
0 1 0 1 2 0 1 22
Based on Task Partitioning Based on task dependency graph
In general the problem is NP complete
0
0 4
0 2 4 6
0 1 2 3 4 5 6 7
Mapping
Dynamic Mapping A process/global memory can hold a set
of tasks Distribute some tasks to all processes Once a process completes its tasks, it
asks the coordinator process for more tasks
Referred to as self-scheduling, work-stealing
Interaction Overheads In spite of the best efforts in mapping,
there can be interaction overheads Due to frequent communications,
exchanging large volume of data, interaction with the farthest processors etc.
Some techniques can be used to minimize interactions
Parallel Algorithm Design - Containing Interaction Overheads
Maximizing data locality Minimizing volume of data exchange
Using higher dimensional mapping Not communicating intermediate results
Minimizing frequency of interactions Minimizing contention and hot spots
Do not use the same communication pattern with the other processes in all the processes
Parallel Algorithm Design - Containing Interaction Overheads
Overlapping computations with interactions Split computations into phases: those that
depend on communicated data (type 1) and those that do not (type 2)
Initiate communication for type 1; During communication, perform type 2
Overlapping interactions with interactions Replicating data or computations
Balancing the extra computation or storage cost with the gain due to less communication
Parallel Algorithm Classification– Types- Models
Parallel Algorithm Types
Divide and conquer Data partitioning / decomposition Pipelining
Divide-and-Conquer
Recursive in structure Divide the problem into sub-problems
that are similar to the original, smaller in size
Conquer the sub-problems by solving them recursively. If small enough, solve them in a straight forward manner
Combine the solutions to create a solution to the original problem
Divide-and-ConquerExample: Merge Sort
Problem: Sort a sequence of n elements
Divide the sequence into two subsequences of n/2 elements each
Conquer: Sort the two subsequences recursively using merge sort
Combine: Merge the two sorted subsequences to produce sorted answer
Partitioning
1. Breaking up the given problem into p independent subproblems of almost equal sizes
2. Solving the p subproblems concurrently
Mostly splitting the input or output into non-overlapping pieces
Example: Matrix multiplication Either the inputs (A or B) or output (C) can
be partitioned.
Pipelining
Occurs with image processing applications where a number of images undergoes a sequence of transformations.
Parallel Algorithm Models Data parallel model
Processes perform identical tasks on different data Task parallel model
Different processes perform different tasks on same or different data – based on task dependency graph
Work pool model Any task can be performed by any process. Tasks are
added to a work pool dynamically Pipeline model
A stream of data passes through a chain of processes – stream parallelism
Parallel Program Classification- Models- Structure- Paradigms
Parallel Program Models Single Program
Multiple Data (SPMD)
Multiple Program Multiple Data (MPMD)
Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/
Parallel Program Structure Types
Master-Worker / parameter sweep / task farming
Embarassingly/pleasingly parallel
Pipleline / systolic / wavefront
Tightly-coupled Workflow
P0
P1 P2 P3 P4
P0 P1 P2 P3 P4
Programming Paradigms Shared memory
model – Threads, OpenMP
Message passing model – MPI
Data parallel model – HPF
Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/
Parallel Architectures Classification - Classification - Cache coherence in shared memory platforms - Interconnection networks
Classification of Architectures – Flynn’s classification Single Instruction
Single Data (SISD): Serial Computers
Single Instruction Multiple Data (SIMD)
- Vector processors and processor arrays
- Examples: CM-2, Cray-90, Cray YMP, Hitachi 3600
Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/
Classification of Architectures – Flynn’s classification Multiple Instruction
Single Data (MISD): Not popular
Multiple Instruction Multiple Data (MIMD)
- Most popular - IBM SP and most other
supercomputers, clusters,
computational Grids etc.
Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/
Classification of Architectures – Based on Memory
Shared memory 2 types – UMA and
NUMA
UMA
NUMA
Examples: HP-Exemplar, SGI Origin, Sequent NUMA-Q
Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/
Classification of Architectures – Based on Memory
Distributed memory
Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/
Recently multi-cores Yet another classification – MPPs, NOW (Berkeley), COW, Computational Grids
Cache Coherence- for details, read 2.4.6 of book
Interconnection networks- for details, read 2.4.2-2.4.5 of book
Cache Coherence in SMPs
Main
Memory
CPU0 CPU1 CPU2 CPU3
cache0 cache1 cache2 cache3
a
a a a a
• All processes read variable ‘x’ residing in cache line ‘a’
•Each process updates ‘x’ at different points of time
Challenge: To maintain consistent view of the data
Protocols:
•Write update
•Write invalidate
Caches Coherence Protocols and Implementations
Write update – propagate cache line to other processors on every write to a processor
Write invalidate – each processor get the updated cache line whenever it reads stale data
Which is better??
Caches –False sharing
Main
Memory
CPU0 CPU1
cache1
A0 – A8
A1, A3, A5…
A9 – A15
cache0
A0, A2, A4…
•Modify the algorithm to change the stride
•Different processors update different parts of the same cache line
•Leads to ping-pong of cache lines between processors
•Situation better in update protocols than invalidate protocols. Why?
Caches Coherence using invalidate protocols 3 states associated with data items
Shared – a variable shared by 2 caches Invalid – another processor (say P0) has updated the data
item Dirty – state of the data item in P0
Implementations Snoopy
for bus based architectures Memory operations are propagated over the bus and snooped Instead of broadcasting memory operations to all processors,
propagate coherence operations to relevant processors Directory-based
A central directory maintains states of cache blocks, associated processors
Implemented with presence bits
Interconnection Networks An interconnection network defined by
switches, links and interfaces Switches – provide mapping between input
and output ports, buffering, routing etc. Interfaces – connects nodes with network
Network topologies Static – point-to-point communication links
among processing nodes Dynamic – Communication links are
formed dynamically by switches
Interconnection Networks Static
Bus – SGI challenge Completely connected Star Linear array, Ring (1-D torus) Mesh – Intel ASCI Red (2-D) , Cray T3E (3-D), 2DTorus k-d mesh: d dimensions with k nodes in each dimension Hypercubes – 2-logp mesh – e.g. many MIMD machines Trees – our campus network
Dynamic – Communication links are formed dynamically by switches Crossbar – Cray X series – non-blocking network Multistage – SP2 – blocking network.
For more details, and evaluation of topologies, refer to book
Evaluating Interconnection topologies Diameter – maximum distance between any two
processing nodes Full-connected – Star – Ring – Hypercube -
Connectivity – multiplicity of paths between 2 nodes. Maximum number of arcs to be removed from network to break it into two disconnected networks Linear-array – Ring – 2-d mesh – 2-d mesh with wraparound – D-dimension hypercubes –
12
p/2logP
12
24
d
Evaluating Interconnection topologies
bisection width – minimum number of links to be removed from network to partition it into 2 equal halves Ring – P-node 2-D mesh - Tree – Star – Completely connected – Hypercubes -
2
Root(P)
1
1
P2/4
P/2
Evaluating Interconnection topologies channel width – number of bits that can be
simultaneously communicated over a link, i.e. number of physical wires between 2 nodes
channel rate – performance of a single physical wire
channel bandwidth – channel rate times channel width
bisection bandwidth – maximum volume of communication between two halves of network, i.e. bisection width times channel bandwidth
END