Mapping Parallel Programs into Hierarchical Distributed Computer Systems
-
Upload
mikhail-kurnosov -
Category
Technology
-
view
94 -
download
4
description
Transcript of Mapping Parallel Programs into Hierarchical Distributed Computer Systems
Mapping Parallel Programs into Hierarchical
Distributed Computer Systems
Prof. Victor G. Khoroshevsky and Mikhail G. Kurnosov
Computer Systems Laboratory,
The A.V. Rzhanov Institute of Semiconductor Physics of Siberian Branch of
Russian Academy of Sciences,
13 Lavrentyev ave., 630090 Novosibirsk, Russia
E-mail: [email protected]
4th International Conference on Software and Data Technologies (ICSOFT 2009)
Sofia, Bulgaria, 26 - 29 July, 2009
Mapping High-Performance Linpack into
hierarchical computer cluster:
Mapping by standard MPI-tools (mpiexec) –
execution time 118 sec. (44 GFLOPS)
Optimized mapping –
execution time 100 sec. (53 GFLOPS)
Mapping Parallel Programs into
Hierarchical Distributed Computer Systems
High-Performance Linpack task graph (NP=8, PMAP=1, BCAST=5)
Computer cluster withhierarchical organization
Two SMP-nodes: 2 x Intel Xeon 5150
2ICSOFT 2009, July 26 – 29, 2009, Sofia, Bulgaria Mikhail Kurnosov
Level 1
Level 2
Related Work
3ICSOFT 2009, July 26 – 29, 2009, Sofia, Bulgaria Mikhail Kurnosov
1. Mapping parallel programs into computer systems (CS) with a
fixed network topology (hybercube, 3D-torus, mesh, etc). A parallel program
represented by a task graph:
(Yu, 2006), (Chen et al. 2006), (Bhanot et al. 2005), (Jose, 1999),(Ahmad, 1997), (Kalinowski, 1994), (Yau, 1993), (Ercal et al. 1990), (Lee, 1989),(Bokhari, 1981).
2. Mapping parallel programs into CSs with arbitrary topology. A parallel
program represented by unweighted task graph:
(Ucar et al., 2006), (Prakash et al., 2004), (Miquel et al., 2003), (Träff, 2002),(Moh, 2001), (Perego, 1998), (Lee, 1989).
Algorithms considering a hierarchical organization of modern distributed
computer systems are needed.
The objective of our research – is development of models and algorithms for
mapping parallel programs into modern hierarchical computer systems
(such as, multicore computer clusters).
Model of Hierarchical Organization of Distributed Computer System
Example of hierarchical organization of computer cluster:
N = 12; L = 3; n23 = 2; C23 = {9, 10, 11, 12}; g(3, 3, 4) = 2; z(1, 7) = 1
Denotations:
C = {1, 2, …, N} – is a set of processor cores;
L – is a number of levels in communication network;
nl – is a number of elements placed at level l ∈ {1, 2, …, L};
nlk – is a number of children of element k ∈ {1, 2, …, nl} at level l;
Сlk – is a set of processor cores belonging to the descendants of element k at level l; clk = |Clk|;
bl – is a bandwidth of communication channels at level l (bit/sec.).
4ICSOFT 2009, July 26 – 29, 2009, Sofia, Bulgaria Mikhail Kurnosov
Given a task graph G = (V, E) and a description of hierarchical organization of
computer system (CS):
• V = {1, 2, …, M} – is a set of parallel processes;
• E ⊆ V × V – is a set of inter-process communications;
• dij – is a volume of data transmitted between process i and j for a program execution time;
• bz(p, q) – is a bandwidth of communication channel between cores p and q;
Mapping – is a function f : V → C, which is defined by values of
Objective – is to minimize a program execution time T(X).
The Problem of Mapping Parallel Programs into Hierarchical
Distributed Computer Systems
)(1 1 1),( minmax)(
ijx
M
j
N
p
N
qqpzijjqip
VibdxxXT →
⋅⋅= ∑∑∑= = =∈
,11
∑=
=N
j
ijx ,,...,2,1 Mi =
,11
∑=
≤M
i
ijx ,,...,2,1 Nj =
},1,0{∈ijx ,Vi ∈ .Cj ∈
Subject to the constraints:
≠
==
.)( else,0
;)( if,1
jif
jifxij
5ICSOFT 2009, July 26 – 29, 2009, Sofia, Bulgaria Mikhail Kurnosov
Task graph partitioning:
The Heuristic Algorithm TMMGP
6ICSOFT 2009, July 26 – 29, 2009, Sofia, Bulgaria Mikhail Kurnosov
b3b3
b2
b3
b1
1V ′
2V ′
3V ′
1LcMk =
Step 1 –Partitioning
Step 2 –Mapping
Task Graph Partitioning in the TMMGP algorithm
7ICSOFT 2009, July 26 – 29, 2009, Sofia, Bulgaria Mikhail Kurnosov
… …1. Coarse graph:Heavy Edge Matching (Karypis, Kumar, 1998)
2. Partition graph Gm into k subsets by recursive bisection (Schloegel et al. 2003)
3. Refine partition by FM heuristic(Fiduccia,
Mattheyses, 1982)
A computational complexity of TMMGP algorithm is O(|E|log2k + M)
Software Tools for Mapping MPI Programs
8ICSOFT 2009, July 26 – 29, 2009, Sofia, Bulgaria Mikhail Kurnosov
Experiments Organization
9ICSOFT 2009, July 26 – 29, 2009, Sofia, Bulgaria Mikhail Kurnosov
MPI programs:
• NAS Parallel Benchmarks (NPB);
• High-Performance Linpack (HPL).
Computer clusters:
• Cluster Xeon16: 4 nodes (2 x Intel Xeon 5150), interconnect: Gigabit/Fast Ethernet;
• Cluster Opteron10: 5 nodes (2 x AMD Opteron 248), interconnect: Gigabit/Fast Ethernet.
13 14 15 16
9 10 11 12
5 6 7 8
1 2 3 4
13 14 15 16
9 10 11 12
5 6 7 8
1 2 3 4
HPL task graph:
16 processes, PMAP=0, BCAST=5
NPB Conjugate Gradient task graph:
16 processes, CLASS B
NPB Multigrid task graph:
16 processes, CLASS B
Experiment Results
10ICSOFT 2009, July 26 – 29, 2009, Sofia, Bulgaria Mikhail Kurnosov
The execution time of TMMGP algorithm on Intel Core 2 Duo 2.13 GHz processor is less then 1 sec.
Cluster interconnect
T(XRR), sec. T(XTMMGP), sec.Speedup
T(XRR) / T(XTMMGP)
High-Performance Linpack
Fast Ethernet 1108.69 911.81 1.22
Gigabit Ethernet
263.15 231.72 1.14
NPB Conjugate Gradient
Fast Ethernet 726.02 400.36 1.81
Gigabit Ethernet
97.56 42.05 2.32
NPB Multigrid
Fast Ethernet 23.94 23.90 1.00
Gigabit Ethernet
4.06 4.03 1.00
• T(XRR) – is the execution time of MPI benchmark with mapping by round robin algorithm
of mpiexec tool (MPICH2 1.0.6).
• T(XTMMGT) – is the execution time of MPI benchmark with mapping by TMMGP algorithm.
The execution time of MPI benchmarks on Xeon16 cluster
Conclusions and Future Works
11ICSOFT 2009, July 26 – 29, 2009, Sofia, Bulgaria Mikhail Kurnosov
Conclusions
• It is required to take into account a hierarchical organization of modern
computer systems and structures of parallel programs in mapping algorithms.
• The proposed algorithm TMMGP allows to reduce execution time of
MPI-programs on 40% in average.
• New algorithms for mapping parallel programs with full task graphs are
required.
Future Works
• Development of new algorithms for mapping parallel programs into arbitrary
subsystems of hierarchical distributed computer systems.
• Integrating the mapping algorithm TMMGP with mpiexec tool and resource
management systems (such as TORQUE).
• Application of the descried approach for optimizing MPI collective operations.
Mapping Parallel Programs into Hierarchical
Distributed Computer Systems
Victor G. Khoroshevsky and Mikhail G. Kurnosov
Computer Systems Laboratory,
The A.V. Rzhanov Institute of Semiconductor Physics of
Siberian Branch of Russian Academy of Sciences,
13 Lavrentyev ave., 630090 Novosibirsk, Russia
E-mail: [email protected]
4th International Conference on Software and Data Technologies (ICSOFT 2009)
Sofia, Bulgaria, 26 - 29 July, 2009
Thank You For Your Attention
Backup Slides
ICSOFT 2009, July 26 – 29, 2009, Sofia, Bulgaria Mikhail Kurnosov
The k-way Graph Partitioning Problem
ICSOFT 2009, July 26 – 29, 2009, Sofia, Bulgaria Mikhail Kurnosov
The example of 3-way graph partitioning:
V’ = {1, 2, …, 12}; k = 3; s = 3;
W(1, 2) = 3; W(1, 3) = 2; W(2, 3) = 4.
It is required to partition graph G’ = (V’, E’) into k disjoint subsets such that
maximal sum of edge-weights incident to any subset is minimized and |V’i| ≤ s.
• w(u, v) – is a weight of edge (u, v) ∈ E’ ;
• W(i, j) – is an additional weight for edges incident to subsets i and j;
• c(u, v, i, j) = w(u, v)W(i, j) – is a total weight of edge (u, v) incident to subsets i and j.
kVVV ′′′ ,...,, 21
The approximate partition:
edge-weights(V’1) = w(1, 5)W(1, 2) +
+ w(6, 8)W(1, 3) + w(2, 3)W(1, 3) = 22;
edge-weights(V’2) = 40
edge-weights(V’3) = 38
1V ′
2V ′
3V ′),,(/),,,( jiLguv bdjivuc =
Heavy Edge Matching algorithm
ICSOFT 2009, July 26 – 29, 2009, Sofia, Bulgaria Mikhail Kurnosov
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
Coarser graphMatching (source graph)
HEM
Graph Bisection
ICSOFT 2009, July 26 – 29, 2009, Sofia, Bulgaria Mikhail Kurnosov
5
3
5
6
4
2
53
2
1
4
Initial vertexBisection