Introduction to Parallel Programming (Message Passing)
description
Transcript of Introduction to Parallel Programming (Message Passing)
Introduction to Parallel Programming
(Message Passing)
Francisco Almeida
Parallel Computing Group
Beowulf Computers
•Distributed Memory
•COTS: Commercial-Off-The-Shelf computers
The Parallel Model
PRAM
BSP, LogP
PVM, MPI, HPF, Threads, OPenMP
Parallel Architectures
Computational Models
Programming Models
Architectural Models
The Message Passing Model
Interconnection Network processor
processor
processor
processor
processor
processor
processor
Send(parameters)
Recv(parameters)
Network of WorkstationsHardware
•Sun Sparc Ultra 1• 143 Mhz Etherswitch
•Distributed Memory•Non Shared Memory Space•Star Topology
SGI Origin 2000Hardware
•C4-CEPBA•64 R1000processos•8 Gb memory•32 Gflop/s
•Shared Dsitributed Memory•Hypercubic Topology
Digital AlphaServer 8400Hardware
•C4-CEPBA•10 Alpha processors21164•2 Gb Memory•8,8 Gflop/s
•Shared Memory•BusTopology
Drawbacks that arise when solving Problems using Parallelism
Parallel Programming is more complex than sequential.
Results may vary as a consequence of the intrinsic non determinism.
New problems. Deadlocks, starvation...
Is more difficult to debug parallel programs.
Parallel programs are less portable.
MPI
MPIEUI
p4
pvm ExpressZipcode
CMMD
PARMACS
Parallel Libraries
Parallel Applications
Parallel Languages
MPI
• What Is MPI?• Message Passing Interface standard • The first standard and portable message passing library with good performance • "Standard" by consensus of MPI Forum participants from over 40 organizations • Finished and published in May 1994, updated in June 1995
• What does MPI offer? • Standardization - on many levels • Portability - to existing and new systems • Performance - comparable to vendors' proprietary libraries • Richness - extensive functionality, many quality implementations
MPI hello.c#include <stdio.h>#include <string.h>#include "mpi.h"main(int argc, char*argv[]) {
int name, p, source, dest, tag = 0;char message[100];MPI_Status status;MPI_Init(&argc,&argv);MPI_Comm_rank(MPI_COMM_WORLD,&name);MPI_Comm_size(MPI_COMM_WORLD,&p);
if (name != 0) { printf("Processor %d of %d\n",name, p); sprintf(message,"greetings from process %d!", name); dest = 0; MPI_Send(message, strlen(message)+1, MPI_CHAR, dest, tag, MPI_COMM_WORLD); } else { printf("processor 0, p = %d ",p); for(source=1; source < p; source++) { MPI_Recv(message,100, MPI_CHAR, source, tag, MPI_COMM_WORLD, &status); printf("%s\n",message); } } MPI_Finalize();}
Processor 2 of 4Processor 3 of 4Processor 1 of 4processor 0, p = 4 greetings from process 1!greetings from process 2!greetings from process 3!
A Simple MPI Program
mpicc –o hello hello.c
mpirun –np 4 hello
Basic Communication Operations
One-to-all broadcast Single-node Accumulation
. . .0 p1 . . .0 p1
M One-to-all broadcast
Single-node AccumulationM MM
. . .
1
p
2
0Step 1
Step 2
Step p
Broadcast on Hypercubes
76
54
32
10
76
54
32
10
First Step
76
54
32
10
76
54
32
10
Second Step
Broadcast on Hypercubes
76
54
32
10
76
54
32
10
Third Step
MPI Broadcast
int MPI_Bcast(
void *buffer;
int count;
MPI_Datatype datatype;
int root;
MPI_Comm comm;
);
Broadcasts a message from the
process with rank "root" to
all other processes of the group
Reduction on Hypercubes
@ conmutative and associative operator
Ai in processor i
Every processor has to obtain A0@A1@...@AP-1
A0@A1 @A2 @A3 000 A1@A0@ A3 @A2 001
A2 @A3@ A0@A1 101 A3 @A2@ A1@A0 101
A7 @A6@A5@A4 101A6 @A7@A4@A5 110
A7 @A6@ A5@A4 101 A0@A1 000 A1@A0 001
A2 @A3 101 A3 @A2 101
A5@A4 101
A7 @A6 101A6 @A7 110
A0 000 A1 001
A2 101A3 101
A5 101
A7 101A6 110
A0
A1
Reductions with MPI
int MPI_Reduce(
void *sendbuf;void *recvbuf;int count;MPI_Datatype datatype;MPI_Op op;int root;MPI_Comm comm;);
Reduces values on all processes to a single value processes
int MPI_Allreduce(
void *sendbuf;void *recvbuf;int count;MPI_Datatype datatype;MPI_Op op;MPI_Comm comm;);
Combines values form all processes and distributes the result back to all
All-To-All BroadcastMultinode Accumulation
. . .0 p1 . . .0 p1
M1All-to-all broadcast
Single-node AccumulationM0
Mp
M1
M2 Mp
M0 M0
M1 M1
MpMp
Reductions, Prefixsums
MPI Collective Operations
MPI Operator Operation
---------------------------------------------------------------
MPI_MAX maximum
MPI_MIN minimum
MPI_SUM sum
MPI_PROD product
MPI_LAND logical and
MPI_BAND bitwise and
MPI_LOR logical or
MPI_BOR bitwise or
MPI_LXOR logical exclusive or
MPI_BXOR bitwise exclusive or
MPI_MAXLOC max value and location
MPI_MINLOC min value and location
The Master Slave Paradigm
Master
Slaves
Computing
0.0 0.2 0.4 0.6 0.8 1.0
2
4
=0
1
4(1+x2)
dx MPI_Bcast(&n, 1, MPI_INT, 0,
MPI_COMM_WORLD);
h = 1.0 / (double) n; mypi = 0.0; for (i = myid + 1; i <= n; i += numprocs) { x = h * ((double)i - 0.5); mypi += f(x); } mypi = h * sum; MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE,
MPI_SUM, 0, MPI_COMM_WORLD);
mpirun –np 3 cpi
The Portability of the Efficiency
The Sequential Algorithm
void mochila01_sec (void)
{
unsigned v1;
int c, k;
for (c = 0; c <= C; c++)
f[0][c] = 0;
for (k = 1; k <= N; k++) {
for (c = 0; c <= C; c++)
f[k][c] = f[k-1][c];
if (c >= w[k])
v1 = f[k-1][c - w[k]] + p[k];
if (f[k][c] > v1)
f[k][c] = v1;
}
}
f[k][c] = max {f[k-1][C], f[k-1][C - W[k] ] + p[k ] for C W[k]}
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
n
C
f[k]f[k -] 1]
O(n*C)
The Parallel Algorithm
1:void transition (int stage)
2:{
3: unsigned x;
4: int c, k;
5: k = stage;
6: for (c = 0; c <= C; c++)
7: f[c] = 0;
8: for (c = 0; c <= C; c++) {
9: IN(&x);
10: f[c] = max(f[c], x);
11: OUT(&f[c], 1, sizeof(unsigned));
12: if (C >= c + w[k])
13: f[c + w[k]] = x + p[k];
14: }
15:}
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
Processor k
f[k-1][c] f[k][c]
Processor k - 1
c
f[k][c] = max {f[k-1][C], f[k-1][C - W[k] ] + p[k]}
The Evolution of the Pipeline
n
C
The Running Time
n -1 + C
n
C
Processor Virtualization
0 1 2
n/p
C
Block Mapping
Processor Virtualization
0 1 2
n/p
C
Block Mapping
Processor Virtualization
0 1 2
n/p
C
The Running Time
0 1 2
n/p
C
(n/p -1)C
(n/p -1)C
....(n/p -1)C
(p-1)(n/p-1)C
+ nC/p
= nC
Processor Virtualization
0 1 2
C
n/p
The Running Time
0 1 2
C
n/pn/p
n/p....
n/p
(p-1)(n/p)
+ nC/p
= nC/p
Block Mapping
void transition (void)
{
unsigned c, k i, inData;
for (c = 0; c <= C; c++){
IN(&inData);
k = calcInitStage();
for (i = 0; i < width; k++, i++) {
f[i] [c] = max(f[i][c], inData);
if (c + w[k] <= C)
f[i][c + w[k]] = inData + p[k];
inData = f[i][c];
}
OUT(&f[i-1][c], 1, sizeof(unsigned));
}
}
width = N / num_proc;
if (f_name < N % num_proc)
/* Load Balancing */
width++;
int calcInitStage( void )
{
return (f_name < N % num_proc) ?
f_name * width :
(f_name * width) + (N % num_proc) ;
}
Cola
Cyclic Mapping
0 1 2
The Running Time
0 1 2
(p-1)
Cola
+ n/p C
Cyclic Mapping
void transition (int stage)
{
unsigned x;
int c, k;
k = stage;
for (c = 0; c <= C; c++)
f[c] = 0;
for (c = 0; c <= C; c++) {
IN(&x);
f[c] = max(f[c], x);
OUT(&f[c], 1, sizeof(unsigned));
if (C >= c + w[k])
f[c + w[k]] = x + p[k];
}
}
int bands = num_bands(n);
for (i = 0; i < bands; i++) {
stage = f_name + i * num_proc;
if (stage <= n - 1)
transition (stage);
}
unsigned num_bands (unsigned n)
{
float aux_f;
unsigned aux;
aux_f = (float) n / (float) num_proc;
aux = (unsigned) aux_f;
if (aux_f > aux)
return (aux + 1);
return (aux);
}
Advantages and Disadvantages
Block Distribution:– Minimizes the Number of Communications
– Penalizes the Startup Time of the Pipeline
Cyclic Distribution:– Minimizes the Startup Time of the Pipeline
– May Produce Communications Overhead
Transputer Network - Local Area Network
Local Area Network– Coarse Grain
– Serial Communications
Transputer Network– Fine Grain
– Parallel Communications
Computational Results
0
50
100
150
200
250
1 2 4 8 16 32
4x8
4x32
4x128
8x8
8x32
8x128
16x8
16x32
16x128
0
5
10
15
20
25
1 2 4 8 16 32
4x8
4x32
4x128
8x8
8x32
8x128
16x8
16x32
16x128
Processos Processors
Tim
e
Tim
e
Transputers Local Area Network
The Resource Allocation Problem
M units of an indivisible Resource and a set of N Tasks. fj(x) Benefit obtained when x unidades of resource are
allocated to task j.
max
Subject to
integer,
fj
xjj
N
xj
Mj
N
xj
Bj
xj j N M Bj
( )
,
,..., ; ,
1
10
1 N
RAP- The Sequential Algorithm
G[k][m] = max{G[k-1][m-i] + fk(i) / 0 i m }
int rap_seq(void) {
int i, k, m;
for (m = 0; m <= M; n++)
G[0][m] = 0;
q = a; Q = b;
for(k = 0; k < N; k++) {
for(m = 0; m <= M; m++) {
for (i = 0; i <= m; i++)
G[k][m] = max{G[k][m],
G[k-1][i] + f[k](m- i)};
}
return G[N ][M];
}
. . .
.. .
. .
. . .
.
. . .
.
. . .
.
. . .
.
. . .
.
. . .
.
N
M
kk -1
O(nM2)
RAP - The Parallel Algorithm
1:void transition (int stage)2:{
3: int m, j, x, k;
4: for( m = 0; m <= M; m++)
5: G[m] = 0;
6: k = stage;
7: for (m = 0; m <= M; m++) {
8: IN(&x);
9: G[m] = max(G[m], x + f(k-1, 0));
10: OUT(&G[m], 1, sizeof(int));
11: for (j = m + 1; j <= M; j++)
12: G[j] = max(G[j], x + f(k - 1, j - m));
13: } /* for m ... */
14: } /* transition */
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
Processor k
G[k-1][m] G[k][m]
Processor k - 1
m
G[k][m] = max{G[k-1][m-i] + fk(i) / 0 i m }
The Cray T3E
CRAY T3E– Shared Address Space
– Three-Dimensional Toroidal Network
Block - Cyclic Mapping
Cola
0 1 2
g(p-1) + gM2 n/gp
Computational Results
05
1015202530354045
1 2 5 10 20 40
Grain
Tim
e
2
4
8
16
0
20
40
60
80
100
120
2 4 8 16
Processsors
Tim
e
10x100
100x1000
400x1000
1000x1000
0
1
2
3
4
5
1 2 5 10 20 40
Grain
Sp
ee
du
p 2
4
8
16
0
1
2
3
4
5
2 4 8 16
Processors
Sp
ee
du
p
1
2
5
10
20
40
Linear Model to Predict Communication Performance
Time to send N bytes= n +
0,00001
0,0001
0,001
0,01
0,1
1
BEOULL
CRAYT3E
5E-08 n + 5E-05
7E-07 n + 0,0003
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
0 5 0000 1 00000 1 5 0000 2 00000 2 5 0000 3 00000 3 5 0000 4 00000 4 5 0000 5 00000 5 5 0000 6 00000 6 5 0000 7 00000 7 5 0000 8 00000 8 5 0000 9 00000 9 5 0000 1 000000 1 05 0000
BEOULL
CRAYT3E
PAPI
http://icl.cs.utk.edu/projects/papi/
PAPI aims to provide the tool designer and application engineer with a consistent interface and methodology for use of the performance counter hardware found in most major microprocessors.
Buffering Data
Virtual Process name runs of real processor fname if (name / grain) mod p = fname
00 11 22 33 44 55 66 77 88 ......
Processor 0 Processor 1 Processor 0
Virtual Processes
P = 2Grain = 3
OUT
IN
SET_BUFIO(1, size)
Size = B
The knapsack ProblemN = 12800, M = 12800
Cray - T3E
50
100
150
10002000
30004000
50006000
140
160
180
200
220
Grain
Problem p128-128.knp - np 2
Buffer
Tim
e (
sec)
50
100
150
10002000
30004000
5000
80
100
120
140
Grain
Problem p128-128.knp - np 4
Buffer
Tim
e (
sec)
2040
6080
100120
140
500
1000
1500
40
50
60
70
Grain
Problem p128-128.knp - np 8
Buffer
Tim
e (
sec)
2040
6080
100120
140
500
1000
1500
20
40
60
Grain
Problem p128-128.knp - np 16
Buffer
Tim
e (
sec)
The Resource Allocation Problem. Cray - T3E
510
1520
50100150200250300350400450
38
40
42
44
46
48
Grain
Problem 1000x1000 - np 4
Buffer
Tim
e (
sec)
510
1520
20406080100120140160
19
20
21
22
23
24
25
Grain
Problem 1000x1000 - np 8
Buffer
Tim
e (
sec)
510
1520
20406080100120140160
10
11
12
13
14
15
16
17
Grain
Problem 1000x1000 - np 16
Buffer
Tim
e (
sec)
510
1520
100200
300400
72
74
76
78
80
Grain
Problem 1000x1000 - np 2
Buffer
Tim
e (
sec)
Portability of the Efficiency
One disappointing contrast in parallel systems is between the peak performance of the parallel systems and the actual performance of parallel applications.
Metrics, techniques and tools have been developed to understand the sources of performance degradation.
An effective parallel program development cycle, may iterate many times before achieving the desired performance.
Performance prediction is important in achieving efficient execution of parallel programs, since it allows to avoid the coding and debugging cost of inefficient strategies.
Most of the approaches to performance analysis fall into two categories: Analytical Modeling and Performance Profiling.
Performance Analysis
Profiling may be conducted on an existing parallel system to recognize current performance bottlenecks, correct them, and identify and prevent potential future performance problems.
Architectural Dependent. The majority of performance
metrics and tools devised reflect their orientation towards the measurement-modify paradigm.
PICL, Dimemas, Kpi.
ParaGraph, Vampir, Paraver.
Instrumentation
Computation
Profile analysis
New Tuning Parameters
Error Prediction
Run Time Prediction
Performance Analysis
Analytical Modeling– Provides a structured way for
understanding performance problems
– Architectural Independent
– Has predictive ability
– Modeling is not a trivial task. The model must be simple enough to be tractable, and sufficiently detailed to be accurate.
– PRAM, LogP, BSP, BSPWB, etc...
Computation
OptimalParameterPrediction
Analytical Modeling
ErrorPrediction
Run TimePrediction
Standard Loop on a Pipeline Algorithm
void f() { Compute(body0); While (running) { Receive(); Compute(body1); Send(); Compute(body2); }
}
body0 take constant time
body1 and body2 depends on the iteration of the loop
Analytical Model
Numerical Solutions for every case
The Analytical Model
B
B
G
B
G
B
G
• Ts denotes the startup time betweentwo processors.
Ts = t0*( G - 1) + + G*i = 1, (B-1) (t1i + t2i )+ 2*I * (G - 1)* B + E * B + + *B
Tc denotes the whole evaluation of G processes, including the time to send M/B packets of size B:
Tc = t0 * G + G*i = 1, M (t1i + t2i )+
2*I*(G - 1)*M + E*M + (*B)* M/B
G
B...
M/B
0 1 2
The Analytical Model
T1(G, B) = Ts * (p - 1) + Tc * N/(G*p)
1 G N/p and 1 B M
G
B
G
B
G
11 p-1p-100
B
B
00
B
G
.
.
.
G
B
G
B
G
11 p-1p-100
B
B
00
B
G
.
.
.
T2(G, B) = Ts * (N/G – 1) + Tc
RR1 1 = = Values (G, B) where Ts * p Tc
RR2 2 = = Values (G, B) where Ts * p Tc
Validation of The Model
Knapsack Problem: Model vs Best Real
0
20
40
60
80
100
120
140
160
2 4 8 16
Processors
Tim
e Model
Best Real
RAP Problem: Model vs Best Real
0
10
20
30
40
50
60
70
80
2 4 8 16
Processors
Tim
e Model
Best Real
The Tuning Problem
Given an Algorithm A, FA is the input/output fuction computed by the algorithm
FA : D = D1x...xDn * * FA(z) is the output value of the Algorithm A for the entry z belonging to D
TimeM(A(z)) is the execution time of the Algorithm A over the input z on a Machine M. CTimeM(A(z)) is the analytical Complexity Time formula that approximates TimeM(A(z))
T = D1x...xDk T Tunning Parameters I = Dk+1x...xDn I Input Parametersx T if and only if, occurs that x has only impact in the performance of the algorithm but not in its
output.
FA(x, z) = FA(y, z) for any x and y T
TimeM(A(x, z)) TimeM(A(y, z)
The Tuning Problem:
is to find x0 T such that CTimeM(A(x0, z)) = min { CTimeM(A(x, z)) / xT}
Tunning Parameters
The list of tuning parameters in parallel computing is extensive:
– The most obvious tuning parameter is the Number of Processors.
– The size of the buffers used during data exchange.
– Under the Master-Slave paradigm, the size and the number of data item generated by the master.
– In the parallel Divide and Conquer technique, the size of a subproblem to be considered trivial and the the processor assignment policy.
– On regular numerical HPF-like algorithms, the block size allocation.
The Methodology
Profiling the execution to compute the parameters needed for the Complexity Time function CTimeM(A(x, z)).
Compute x0T such that minimizes the Complexity Time function CTimeM(A(x, z)).
CTimeM(A(x0, z)) = min { CTimeM(A(x, z)) /xT}
At this point, the predictive ability of the Complexity Time function can be used to predict the execution time TimeM(A(z)) of an optimal execution or to execute the algorithm according to the tuning parameter T.
Instrumentation
Analytical Modeling
Optimal ParameterComputation
Run TimePrediction
Error Prediction Computation
llp Solver
Instrumentation onllp Comunication
Calls
llp AnalyticalModelin
g
Computation
Run TimePrediction
Error Prediction
Min( T(p, G, B))
Computation
t0, t1 and t2
IN
OUT
gettime();
gettime();
BA
MA
LL
The MALLBA Infrastructure
Performace PredictionBA - ULL
0,0001 n - 0,0151
9E-05 n + 0,005
0
1
2
3
4
5
0 1 0 0 0 2 0 0 0 3 0 0 0 4 0 0 0 5 0 0 0 6 0 0 0 7 0 0 0 8 0 0 0 9 0 0 0 1 0 0 0 0 1 1 0 0 0 1 2 0 0 0 1 3 0 0 0 1 4 0 0 0 1 5 0 0 0 1 6 0 0 0 1 7 0 0 0 1 8 0 0 0 1 9 0 0 0 2 0 0 0 0 2 1 0 0 0 2 2 0 0 0 2 3 0 0 0 2 4 0 0 0 2 5 0 0 0 2 6 0 0 0 2 7 0 0 0 2 8 0 0 0 2 9 0 0 0 3 0 0 0 0 3 1 0 0 0 3 2 0 0 0 3 3 0 0 0
BAULL-1
BAULL-2
0,01
0,1
1
10
BAULL-1
BAULL-2
0,00001
0,0001
0,001
0,01
0,1
1
10
100
1 4
16
64
25
6
10
24
40
96
16
38
4
65
53
6
3E
+0
5
1E
+0
6
BEOULL
CRAYT3E
BAULL-1
BAULL-2
The MALLBA Project
Library for the resolution of combinatorial optimisation problems.– 3 types of resolution techniques:
• Exact• Heuristic• Hybrid
– 3 implementations: • Sequential• LAN• WAN
Goals:– Genericity– Ease of utilization– Locally- and geographically-distributed computation
References
Willkinson B., Allen M. Parallel Programming. Techniques and Applications Using Networkded Workstations and Parallel Computers. 1999. Prentice-Hall.
Gropp W., Lusk E., Skjellum A. Using MPI. Portable Parallel Programming with the Message-Passing Interface. 1999. The MIT Press.
Pacheco P. Parallel Programming with MPI. 1997. Morgan Kaufmann Publishers.
Wu X. Performance Evaluation, Prediction and Visualization of Parallel Systems.
nereida.deioc.ull.es