MLD2P4: a package of parallel algebraic multilevel Preconditioners Pasqua DAmbra, Institute for...
-
Upload
jackson-dalton -
Category
Documents
-
view
219 -
download
2
Transcript of MLD2P4: a package of parallel algebraic multilevel Preconditioners Pasqua DAmbra, Institute for...
MLD2P4: a package of parallel
algebraic multilevel Preconditioners
Pasqua D’Ambra, Institute for High-Performance Computing and Networking (ICAR-CNR), Naples Branch, Italy
Bologna, March 2008
joint work with Daniela di Serafino, Second University of NaplesSalvatore Filippone, University of Rome “Tor-Vergata”
Pasqua D'Ambra - Bologna March 2008
2
Overview Motivations
Background Objectives
MLD2P4: Multi-Level Domain Decomposition Parallel Preconditioners Package based on PSBLAS Algorithms and computational kernels Software architecture
Some Results & Applications
Pasqua D'Ambra - Bologna March 2008
3
Background
Large-scale applications have to solve
bAx The linear system matrix is:
Real or complex and squareLarge and SparseDistributed among parallel processorsMatrix dimensions and entries, conditioning, sparsity pattern and coupling among variables vary along simulations
Pasqua D'Ambra - Bologna March 2008
4
Background (cont’d)
What is the best method/preconditioner? No absolute winner, experimentation is needed Reliable preconditioners require access to the complete
matrix Parallel implementation is not trivial
Interfacing with application software is required Custom-made interfaces to parallel legacy codes Different interfaces for different
preconditioners/solvers
Pasqua D'Ambra - Bologna March 2008
5
Objectivesdesigning and implementing a suite of
algebraic preconditioners based on Linear Algebra kernels for parallel sparse matrix computations
Flexibility Different preconditioners by single API
Portability & Efficiency Standard base software for serial kernels and data
communications Simplicity of usage
Modern (OO) Fortran 95 features and auxiliary routines for smooth legacy code integration
Pasqua D'Ambra - Bologna March 2008
6
MLD2P4Multi-Level Domain Decomposition
Parallel Preconditioners Package based on PSBLAS
Diagonal Block-Jacobi Additive Schwarz
with arbitrary overlap Algebraic
multi-level Schwarz
PSBLASParallel Sparse Basic Linear Algebra Subprograms
mld_prec_build(A,M,…)A, distributed sparse matrix (input)M, distributed sparse preconditioner (output)
mld_prec_apply(M,x,y,…)M, distributed sparse preconditioner (input)x,y, distributed vectors (input/output)
Pasqua D'Ambra - Bologna March 2008
7
PSBLAS (Filippone et al., http://www.ce.uniroma2.it/psblas/)
Basic Linear Algebra Operations with Sparse Matrices on MIMD Architectures
Iterative Sparse Linear SolversCG, BiCG, CGS, BiCGSTAB,
RGMRES,…
Ap
pl.
MPI
BLACSBasic Linear Algebra
Communication Subprograms
F95
SBLAS (Duff et al.)
Base
sw
Parallel Sparse Matrix Operations
matrix-matrix products, matrix-vector products, … K
ern
elsParallel Sparse Matrix
Managementallocate, build, update,
…
F77
Pasqua D'Ambra - Bologna March 2008
8
MLD2P4 DesignAlgorithms
Algebraic multi-level Schwarz preconditioners based on smoothed aggregation
good trade-off between parallelism and convergence optimal scalability for symmetric positive-definite matrices algebraic framework allows general-purpose application
Pasqua D'Ambra - Bologna March 2008
9
(1-lev) Schwarz: basic ingredients
patternsparsity symmetric nnA Adjacency graph of A
0a :ji,E,n1,2,3,...,W
,EW,G
ij
Ekj, : WkWj
,WW1δ
iδ
i
1δi
δi
-overlap partition of W
0-overlap partition of W
W,,...,m, iWi of partition 10
01W
02W12W
11W
1 2 3 4 5 6 7 8 9
123456789
Pasqua D'Ambra - Bologna March 2008
10
AS: basic ingredients (cont’d)
δii
T jjj
δi Wj ,e,...,e,eR
n21
Tδi
δi RP
Restriction/prolongation operators
Restriction of A
Tδi
δi
δi RARA
1 2 3 4 5 6 7 8 9
123456789
11A
12A
Pasqua D'Ambra - Bologna March 2008
11
Coarse level correction: basic ingredients
TCC
1C PR ,PADIP
Algebraic coarsening
uncoupled aggregation
otherwise,0
)j .aggr()i (vert. if,1P
WW:P where
ij
C
Smoothed prol./restr.
operators
Coarse-level
matrixC
TC
TCCC ARRAPPA
Pasqua D'Ambra - Bologna March 2008
12
Multilevel-Schwarz preconditioners & computational kernels
TCCC
1C
TC
C
C
ARRA :matx mat
PADIPR :matx mat
WW:P :aggregate
Abuild
Example: 2-lev hybrid-post
1CH2L MAMIMM
11
11
12 LL
build
δiA build
apply
P. D’Ambra, D. di Serafino, S. Filippone, On the Development of PSBLAS-based Parallel Two-level Schwarz Preconditioners, Applied Numerical Mathematics, 57, 2007.
CAwvx :vetmat
xMw :prec AS 1L12L
yRw :prol
zyA :esolv
vRz :ictrestr
TCC
C
C
Pasqua D'Ambra - Bologna March 2008
13
MLD2P4 DesignSoftware Architecture
Parallel PreconditionersBJA, ASM, RAS, ASH, ml-additive,
ml-hybridpre, ml-hybridpost, ml-symmhybrid App
l.
Preconditioner Buildprolongation, restriction,
coarse matrix, local sparse ILU and LU
Ker
nelsPreconditioner
Applicationdistributed & serial
coarse matrix solvers
PSBLAS 2.0extended version of PSBLAS 1.0
Base
sw
Pasqua D'Ambra - Bologna March 2008
14
Performance Results & Comparisons
Different test matrices from various sources
thm matrices: thermal diffusion in solids
kivap matrices: automotive engine design
shipsec matrices: from UF sparse matrix collection
Experiments carried out on different Linux clusters
64 Intel Itanium dual-processor nodes connected by Quadrics QSNetII Elan 4
32 AMD Opteron dual-processor nodes connected by Myrinet
8 AMD Opteron dual-processor nodes connected by InfiniBand
8 Intel Itanium dual-processor nodes connected by Myrinet
16 Intel Pentium IV nodes connected by Fast Ethernet
Comparison with up-to-date related work
Trilinos-MLA. Buttari, P. D’Ambra, D. di Serafino, S. Filippone, 2LEV-D2P4: a package of high-performance
preconditioners for scientific and engineering applications , Applicable Algebra in Engineering,
Communication and Computing, Vol. 18, 2007.
Pasqua D'Ambra - Bologna March 2008
15
Experimental Setting
MLD2P4: right-preconditioned BiCGSTAB 1-lev Restricted Additive Schwarz preconditioner with ILU(0) (RAS)
2-lev hybrid Schwarz preconditioner, with RAS/ILU(0) as 1-lev prec.
Distributed coarsest matrix: 4 sweeps of block Jacobi with ILU(0) (2LDI) or with UMFPACK (2LDU) on diagonal blocks
3-lev hybrid Schwarz preconditioner, with RAS/ILU(0) as 1-lev prec.
Distributed coarsest matrix: 4 sweeps of block Jacobi with ILU(0) (3LDI) or with UMFPACK (3LDU) on diagonal blocks
60 10rrk
Stopping criterion: or maxitUnit right-hand side and null starting guessRow-block distribution of matrices: # submatrices = # procs
Pasqua D'Ambra - Bologna March 2008
16
thm matrices: number of iterations
npOV=0
RAS 2LDI 2LDU 3LDI 3LDU
1 613 190 - 70 -
2 705 184 - 72 -
4 761 206 - 74 -
8 688 202 44 67 28
16 748 211 61 70 36
32 766 186 81 69 51
64 809 196 113 86 68
thm1n = 600000
nnz = 2996800
64 Intel Itanium dual-processornodes connected by QSNetII
npOV=1
RAS 2LDI 2LDU 3LDI 3LDU
1 613 190 - 70 -
2 923 183 - 76 -
4 684 178 - 63 -
8 937 191 34 62 27
16 688 172 57 68 33
32 714 181 74 65 45
64 720 180 107 77 62
Pasqua D'Ambra - Bologna March 2008
17
thm matrices: execution times and speed-ups (OV=1; best execution times:3LDU)
64 Intel Itanium dual-processornodes connected by QSNetII
Pasqua D'Ambra - Bologna March 2008
18
Application test case
large eddy simulation of incompressible turbulent flows in a bi-periodical
channel main computational kernel
nonsymmetric and singular linear systems arising from elliptic PDE with Neumann b.c.
A. Aprovitola, P. D’Ambra, F. M. Denaro, D. di Serafino, S. Filippone, Application of Parallel Algebraic Multilevel Domain Decomposition Preconditioners in Large-Eddy Simulations of Wall-bounded Turbulent Flows: First Experiments, RT-ICAR-NA-2007-02, July 2007.
Pasqua D'Ambra - Bologna March 2008
19
Experimental Setting
MLD2P4: right-preconditioned RGMRES(30) 1-lev Restricted Additive Schwarz preconditioner with ILU(0) (RAS)
2-lev/3-lev hybrid Schwarz preconditioner, with RAS/ILU(0) as 1-lev prec.
Distributed coarse matrix: 4 sweeps of block Jacobi with ILU(0) (2LDI/3LDI) on diagonal blocks
Stopping criterion: or maxit General row-block distribution
70k 10rr
Pressure linear system
n=201600
nnz=1398600
Reynolds number: 180Computational Grid: 140x32x45 non-uniform in the y direction, time-step 10-4
Pasqua D'Ambra - Bologna March 2008
20
LES of incompressible wall-bounded flow
16 Intel Itanium dual-processornodes connected by QSNetII
SOR on 1 proc.=9 sec.SOR on 1 proc.=8580 sec.
Pasqua D'Ambra - Bologna March 2008
21
Work in progress Package available on the web very
soon
More sophisticated aggregation algorithms
Integration of preconditioners and solvers in large-scale applications