Massively parallel implementation of Total-FETI DDM with application to medical image registration

Massively parallel implementation of Total-FETI DDM with application

to medical image registrationMichal Merta

Alena VašatováVáclav HaplaDavid Horák

DD21, Rennes, France

solution of large-scale scientific and engineering problems possibly hundreds of millions DOFs linear problems non-linear problems

non-overlapping, FETI methods with up to tens of thousands of subdomains

usage of PRACE Tier-1 and Tier-0 HPC systems

Motivation

developed by Argonne National Laboratory data structures and routines for the scalable parallel

solution of scientific applications modeled by PDE coded primarily in C language, but good FORTRAN

support, can also be called from C++ and Python codes current version is 3.2 www.mcs.anl.gov/petsc petsc-dev (development branch) is intensively evolving code and mailing lists open to anybody

PETSc(Portable, Extensible Toolkit for Scientific computation)

http://www.mcs.anl.gov/petsc




PETSc components

seq. / par.

developed by Sandia National Laboratories collection of relatively independent packages toolkit for basic linear algebra operations, direct and

iterative solvers for linear systems, PDE discretization utilities, mesh generation tools etc.

object oriented design, high modularity, use of modern C++ features (templating)

mainly in C++ (Fortran and Python bindings) current version 10.10 trilinos.sandia.gov

Trilinos

Trilinos components

are parallelized on the data level (vectors & matrices) using MPI

use BLAS and LAPACK – de facto standard for dense LA have their own implementation of sparse BLAS include robust preconditioners, linear solvers (direct and

iterative) and nonlinear solvers can cooperate with many other external solvers and

libraries (e.g. MATLAB, MUMPS, UMFPACK, …) support CUDA and hybrid parallelization are licensed as open-source

Both PETSc and Trilinos…

Problem of elastostatics

... boundary with prescribed surface traction

... boundary with prescribed

... isotropic e

displacements ... body loads

lastic body

F

U

f

f

TFETI decomposition

12 G

34 G

24 G13 G

... artificial boundariesbetween subdomains and with prescribed gluing conditions- enforced byLagrange multipliers

pqG

p q

The FEM discretization with a suitable numbering of nodes results in the QP problem:

Primal discretized formulation

1min s. t.2

T T u

u Ku f u Bu c1diag( ) is a symmetric positive semidefinite (and so singular in general)block-diagonal global stiffness matrix

is a stiffness matrix of the subdomain is a

,...,

full rank cons r t t ain

NS

s

n

sm

n

n

B

KK K

K matrix, constraint RHS

1 is a load vectornfc

Dual discretized formulation(homogenized)

1min s.t.2

T T λ

λ Aλ λ b Gλ o

1( ) (Im Ker (

( ) (

Im Ker )

Im Ker )

)

T

T T

T T T

K K

F BK

G R B

GG G Q GP I Q P GA

K KKR R K

B

Q

FP

G

P Q

10

0())

(

T

T T

f

λ G

d BK

e R f

G eb d F

GλP

QP problem again, but with lower dimension and simpler constraints

Primal data distribution,F action

… straightforwardmatrix distribution,

given by a decomposition

*Fλ

very sparse

block diagonal embarrassingly parallel

Coarse projector action

1( ) ,T T GG G PG IQ Q

*

… can easily take 85 % of computation time if not properly parallelized!

?

?

?

G preprocessing and action

preprocessing

action

?

Coarse problempreprocessing and action

preprocessing

action

? 3

1

2

Currently used variant: B2(PPAM 2011)

Coarse problem

the UK's largest, fastest and most powerful supercomputer supplied by Cray Inc., operated by EPCC

uses the latest AMD "Bulldozer" multicoreprocessor architecture

704 compute blades each blade with 4 compute nodes giving

a total of 2816 compute nodes each node with two 16-core AMD Opteron 2.3GHz Interlagos

processors → 32 cores per node total of 90 112 cores each 16-core processor shares 16Gb of memory, in total 60 Tb theoretical peak performance over 800 Tflops

HECToR phase 3 (XE6)

www.hector.ac.uk

http://www.hector.ac.uk/

K+ implemented as direct solve (LU) of regularized K

built-in CG routine used(PETSc.KSP, Trilinos.Belos)

E = 1e6, = 0.3, g = 9.81 ms-2 computed @ HECToR

Benchmark

Results

# subds = # cores 1 4 16 64 256 1024

Prim. dim. 31752 127 008 508 032 2 032 128

8 128 512

32 514 048

Dual dim. 252 1512 7056 30240 124992 508032Solution time Trilinos 1.39 3.01 4.80 6.25 10.31 28.05 PETSc 1.14 2.66 4.16 4.74 4.92 5.84# iterations Trilinos 34 63 96 105 105 102 PETSc 33 68 94 105 105 1021 iter. time Trilinos 4.48e-2 4.76e-2 5.00e-2 5.95e-2 9.81e-2 2.75e-1

PETSc 3.46e-2 3.92e-2 4.42e-2 4.52e-2 4.69e-2 5.73e-2

stopping criterion: ||rk|| / || r0|| < 1e-5 without preconditioning

Process of integrating information from two (or more) different images

Images from different sensors, different angles or/and times

Application to image registration

Application to image registration

In medicine: Monitoring of growth of a tumour Therapy valuation Comparison of patient data with anathomical atlas Data from magnetic resonance (MR), computer

tomography (CT), positron emission tomography (PET)

The task is to minimize the distance between two images

Elastic registration

𝜑≔𝑥−𝑢 (𝑥 )→

𝑇 𝑅

Parallelization using TFETI method

Elastic registration

# of subdomains 1 4 16

Primal variables 20402 81608 326432

Dual variables 903 2641 8254

Solution time [s] 41 34.54 57.44

# of iterations 2467 990 665

Time/iteration [s] 0.01 0.03 0.08

Results

stopping criterion: ||rk|| / || r0|| < 1e-5

Solution

To consolidate PETSc & Trilinos TFETI implementation to the form of extensions or packages

To further optimize the codes using core-hours on Tier-1/Tier-0 systems (PRACE DECI Initiative, HPC-Europa2)

To extend image registration to 3D data

Conclusion and future work

KOZUBEK T. et al. Total FETI domain decomposition method and its massively parallel implementation. Accepted for publishing in Advances in Engineering Software.

HORAK, D.; HAPLA, V. TFETI coarse space projectors parallelization strategies. Accepted for publishing in the proceedings of PPAM 2011, Springer LNCS, 2012.

Zitova B., Flusser J., Image registration methods: a survey, Image and Vision Computing, Vol.21, No.11, 2003, pp. 977-100.

References

Thank you for your attention!

Massively parallel implementation of Total-FETI DDM with application to medical image registration

Documents

Transcript of Massively parallel implementation of Total-FETI DDM with application to medical image registration