Massively parallel implementation of Total-FETI DDM with application to medical image registration
description
Transcript of Massively parallel implementation of Total-FETI DDM with application to medical image registration
Massively parallel implementation of Total-FETI DDM with application
to medical image registrationMichal Merta
Alena VašatováVáclav HaplaDavid Horák
DD21, Rennes, France
solution of large-scale scientific and engineering problems possibly hundreds of millions DOFs linear problems non-linear problems
non-overlapping, FETI methods with up to tens of thousands of subdomains
usage of PRACE Tier-1 and Tier-0 HPC systems
Motivation
developed by Argonne National Laboratory data structures and routines for the scalable parallel
solution of scientific applications modeled by PDE coded primarily in C language, but good FORTRAN
support, can also be called from C++ and Python codes current version is 3.2 www.mcs.anl.gov/petsc petsc-dev (development branch) is intensively evolving code and mailing lists open to anybody
PETSc(Portable, Extensible Toolkit for Scientific computation)
PETSc components
seq. / par.
developed by Sandia National Laboratories collection of relatively independent packages toolkit for basic linear algebra operations, direct and
iterative solvers for linear systems, PDE discretization utilities, mesh generation tools etc.
object oriented design, high modularity, use of modern C++ features (templating)
mainly in C++ (Fortran and Python bindings) current version 10.10 trilinos.sandia.gov
Trilinos
Trilinos components
are parallelized on the data level (vectors & matrices) using MPI
use BLAS and LAPACK – de facto standard for dense LA have their own implementation of sparse BLAS include robust preconditioners, linear solvers (direct and
iterative) and nonlinear solvers can cooperate with many other external solvers and
libraries (e.g. MATLAB, MUMPS, UMFPACK, …) support CUDA and hybrid parallelization are licensed as open-source
Both PETSc and Trilinos…
Problem of elastostatics
... boundary with prescribed surface traction
... boundary with prescribed
... isotropic e
displacements ... body loads
lastic body
F
U
f
f
TFETI decomposition
12 G
34 G
24 G13 G
... artificial boundariesbetween subdomains and with prescribed gluing conditions- enforced byLagrange multipliers
pqG
p q
The FEM discretization with a suitable numbering of nodes results in the QP problem:
Primal discretized formulation
1min s. t.2
T T u
u Ku f u Bu c1diag( ) is a symmetric positive semidefinite (and so singular in general)block-diagonal global stiffness matrix
is a stiffness matrix of the subdomain is a
,...,
full rank cons r t t ain
NS
s
n
sm
n
n
B
KK K
K matrix, constraint RHS
1 is a load vectornfc
Dual discretized formulation(homogenized)
1min s.t.2
T T λ
λ Aλ λ b Gλ o
1( ) (Im Ker (
( ) (
Im Ker )
Im Ker )
)
T
T T
T T T
K K
F BK
G R B
GG G Q GP I Q P GA
K KKR R K
B
Q
FP
G
P Q
10
0())
(
T
T T
f
λ G
d BK
e R f
G eb d F
GλP
QP problem again, but with lower dimension and simpler constraints
Primal data distribution,F action
… straightforwardmatrix distribution,
given by a decomposition
*Fλ
very sparse
block diagonal embarrassingly parallel
Coarse projector action
1( ) ,T T GG G PG IQ Q
*
… can easily take 85 % of computation time if not properly parallelized!
?
?
?
G preprocessing and action
preprocessing
action
?
Coarse problempreprocessing and action
preprocessing
action
? 3
1
2
Currently used variant: B2(PPAM 2011)
Coarse problem
the UK's largest, fastest and most powerful supercomputer supplied by Cray Inc., operated by EPCC
uses the latest AMD "Bulldozer" multicoreprocessor architecture
704 compute blades each blade with 4 compute nodes giving
a total of 2816 compute nodes each node with two 16-core AMD Opteron 2.3GHz Interlagos
processors → 32 cores per node total of 90 112 cores each 16-core processor shares 16Gb of memory, in total 60 Tb theoretical peak performance over 800 Tflops
HECToR phase 3 (XE6)
www.hector.ac.uk
K+ implemented as direct solve (LU) of regularized K
built-in CG routine used(PETSc.KSP, Trilinos.Belos)
E = 1e6, = 0.3, g = 9.81 ms-2 computed @ HECToR
Benchmark
Results
# subds = # cores 1 4 16 64 256 1024
Prim. dim. 31752 127 008 508 032 2 032 128
8 128 512
32 514 048
Dual dim. 252 1512 7056 30240 124992 508032Solution time Trilinos 1.39 3.01 4.80 6.25 10.31 28.05 PETSc 1.14 2.66 4.16 4.74 4.92 5.84# iterations Trilinos 34 63 96 105 105 102 PETSc 33 68 94 105 105 1021 iter. time Trilinos 4.48e-2 4.76e-2 5.00e-2 5.95e-2 9.81e-2 2.75e-1
PETSc 3.46e-2 3.92e-2 4.42e-2 4.52e-2 4.69e-2 5.73e-2
stopping criterion: ||rk|| / || r0|| < 1e-5 without preconditioning
Process of integrating information from two (or more) different images
Images from different sensors, different angles or/and times
Application to image registration
Application to image registration
In medicine: Monitoring of growth of a tumour Therapy valuation Comparison of patient data with anathomical atlas Data from magnetic resonance (MR), computer
tomography (CT), positron emission tomography (PET)
The task is to minimize the distance between two images
Elastic registration
𝜑≔𝑥−𝑢 (𝑥 )→
𝑇 𝑅
Parallelization using TFETI method
Elastic registration
# of subdomains 1 4 16
Primal variables 20402 81608 326432
Dual variables 903 2641 8254
Solution time [s] 41 34.54 57.44
# of iterations 2467 990 665
Time/iteration [s] 0.01 0.03 0.08
Results
stopping criterion: ||rk|| / || r0|| < 1e-5
Solution
To consolidate PETSc & Trilinos TFETI implementation to the form of extensions or packages
To further optimize the codes using core-hours on Tier-1/Tier-0 systems (PRACE DECI Initiative, HPC-Europa2)
To extend image registration to 3D data
Conclusion and future work
KOZUBEK T. et al. Total FETI domain decomposition method and its massively parallel implementation. Accepted for publishing in Advances in Engineering Software.
HORAK, D.; HAPLA, V. TFETI coarse space projectors parallelization strategies. Accepted for publishing in the proceedings of PPAM 2011, Springer LNCS, 2012.
Zitova B., Flusser J., Image registration methods: a survey, Image and Vision Computing, Vol.21, No.11, 2003, pp. 977-100.
References
Thank you for your attention!