Large Scale Drop Impact Analysis of Mobile Phone Using...

26
Large Scale Drop Impact Analysis of Mobile Phone Using ADVC on Blue Gene/L Hiroshi Akiba ([email protected]), Tomonobu Ohyama ([email protected]), Yoshinori Shibata ([email protected]), Kiyoshi Yuyama ([email protected]), Yoshikazu Katai ([email protected]), Ryuichi Takeuchi ([email protected]), Takeshi Hoshino ([email protected]) Allied Engineering Corporation, Japan Shinobu Yoshimura ([email protected]) University of Tokyo, Japan Hirohisa Noguchi ([email protected]) Keio University, Japan Manish Gupta ([email protected]), John A Gunnels ([email protected]), Vernon Austel ([email protected]) IBM Thomas J. Watson Research Center, USA Yogish Sabharwal ([email protected]), Rahul Garg ([email protected]) IBM India Research Laboratory, India Shoji Kato ([email protected])Takashi Kawakami ([email protected]) Toshiba Corporation, Japan Satoru Todokoro ([email protected]), Junko Ikeda ([email protected]) NIWS Co., Ltd., Japan Abstract Existing commercial finite element analysis (FEA) codes do not exhibit the performance necessary for large scale analysis on parallel computer systems. In this paper, we demonstrate the performance characteristics of a commercial parallel structural analysis code, ADVC, on Blue Gene/L (BG/L). The numerical algorithm of ADVC is described, tuned, and optimized on BG/L, and then a large scale drop impact analysis of a mobile phone is performed. The model of the mobile phone is a nearly-full assembly that includes inner structures. The size of the model we have analyzed has 47 million nodal points and 142 million DOFs. This does not seem exceptionally large, but the dynamic impact analysis of a product model, with the contact condition on the entire surface of the outer case under this size, cannot be handled by other CAE systems. Our analysis is an unprecedented attempt in the electronics industry. It took only half a day, 12.1 hours, for the analysis of about 2.4 milliseconds. The floating point operation performance obtained has been 538 GFLOPS on 4096 node of BG/L. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SC2006 November 2006, Tampa, Florida, USA 0-7695-2700-0/06 $20.00 ©2006 IEEE 1

Transcript of Large Scale Drop Impact Analysis of Mobile Phone Using...

Page 1: Large Scale Drop Impact Analysis of Mobile Phone Using ...sc06.supercomputing.org/schedule/pdf/gb112.pdf · structural analysis code, ADVC, on Blue Gene/L (BG/L). The numerical algorithm

Large Scale Drop Impact Analysis of Mobile Phone Using ADVC on Blue Gene/L

Hiroshi Akiba ([email protected]), Tomonobu Ohyama ([email protected]), Yoshinori Shibata ([email protected]), Kiyoshi Yuyama ([email protected]), Yoshikazu Katai ([email protected]), Ryuichi Takeuchi ([email protected]),

Takeshi Hoshino ([email protected]) Allied Engineering Corporation, Japan

Shinobu Yoshimura ([email protected]) University of Tokyo, Japan

Hirohisa Noguchi ([email protected]) Keio University, Japan

Manish Gupta ([email protected]), John A Gunnels ([email protected]), Vernon Austel ([email protected]) IBM Thomas J. Watson Research Center, USA

Yogish Sabharwal ([email protected]), Rahul Garg ([email protected]) IBM India Research Laboratory, India

Shoji Kato ([email protected]),Takashi Kawakami ([email protected]) Toshiba Corporation, Japan

Satoru Todokoro ([email protected]), Junko Ikeda ([email protected]) NIWS Co., Ltd., Japan

Abstract Existing commercial finite element analysis (FEA) codes do not exhibit the performance necessary for large scale analysis on parallel computer systems. In this paper, we demonstrate the performance characteristics of a commercial parallel structural analysis code, ADVC, on Blue Gene/L (BG/L). The numerical algorithm of ADVC is described, tuned, and optimized on BG/L, and then a large scale drop impact analysis of a mobile phone is performed. The model of the mobile phone is a nearly-full assembly that includes inner structures. The size of the model we have analyzed has 47 million nodal points and 142 million DOFs. This does not seem exceptionally large, but the dynamic impact analysis of a product model, with the contact condition on the entire surface of the outer case under this size, cannot be handled by other CAE systems. Our analysis is an unprecedented attempt in the electronics industry. It took only half a day, 12.1 hours, for the analysis of about 2.4 milliseconds. The floating point operation performance obtained has been 538 GFLOPS on 4096 node of BG/L.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

SC2006 November 2006, Tampa, Florida, USA

0-7695-2700-0/06 $20.00 ©2006 IEEE

1

Page 2: Large Scale Drop Impact Analysis of Mobile Phone Using ...sc06.supercomputing.org/schedule/pdf/gb112.pdf · structural analysis code, ADVC, on Blue Gene/L (BG/L). The numerical algorithm

1. Introduction Large scale analysis is emerging as a major and realistic tool for the design analysis of industrial products, such as automobiles, electronics, energy, and other engineering fields. Advances in computer systems now allow much more ambitious analysis. The full scale analysis of a realistic model, rather than a local or simplified one, is desired in the present environment and is now a practical goal. The size of analyses will increase more than tenfold in the course of a few years. However, the existing and traditional commercial finite element analysis (FEA) codes do not seem to exhibit the requisite performance for large scale analysis on parallel computer systems.

The chief difficulty with the traditional FEA codes on parallel computer systems is that one must utilize an implicit method capable of obtaining enough parallel performance ([1]). To parallelize the traditional implicit method, the domain decomposition (DD) method is one possible approach. A direct method for the stiffness equation is applied on each parallel domain, then the mismatch appears between the boundaries shared by two subdomains, canceling through the CG iteration. This calculation scheme is a standard framework of an iterative DD method. While it is suitable for parallel processing, if no further techniques are incorporated, this will perform poorly in practice. The procedure above is nothing more than a distribution of the direct method on the whole domain to the subdomains. In addition, the CG iteration gives rise to additional overhead on the calculation. The Balancing Domain Decomposition (BDD) method ([2], [3]) is a pioneering work that brought practical parallel performance to the DD method, by taking a rigid coarse motion into the iterative algorithm of Neumann preconditioned DD algorithm, eliminating the “floating” motion of the subdomains appearing in the Neumann preconditioned DD method (See Section 3.5).

We briefly note here the Finite Element Tearing and Interconnecting (FETI) ([4], [5], [6], [7], [1]) method. FETI is a refined iterative DD method like BDD, which uses Lagrange undetermined multipliers in order to impose continuity of displacement field in the boundary of the subdomains. The rigid body motion of the subdomains is considered and balanced in each iteration step, but the continuity of the displacement is satisfied only when it is converged. Although FETI shows good parallel performance, it has been improved to solve fourth order differential equations, such as a shell problem, more efficiently. These are two-level FETI and its enhanced algorithm FETI-DP. The Gordon Bell Prize winners in SC2002 1 , M.Bhardwaj et al ([8]), use

1 See comments in Section 1 on the work of SC2004

FETI-DP in Salinas and demonstrate good performance with this approach.

Turning our attention back to the standard DD method, we note that there are two ways to improve upon it. One is to prepare an effective preconditioner such as BDD or FETI; the other is to reduce the load of the direct method. We have developed the CGCG method ([10]), incorporating these ideas into the algorithm, and implemented this in ADVC.

ADVC is a commercial structural analysis code based on the ADVENTURE system ([11], [12]). ADVENTURE was developed by the ADVENTURE Project (The Development of Computational Mechanics System of Large Scale Analysis and Design, “Research for the Future Program” of JSPS: Japan Society for the Promotion of Science), under collaboration with University of Tokyo, Keio University, Kyushu University, Allied Engineering, and other organizations.

Using the ADVENTURE framework, ADVC has been developed, enhancing both the solver performance and functionality to a level of practical analysis. In order to improve the solver performance, we incorporated two techniques into its architecture: We gave up the direct method on the subdomains from an early stage of development and prepared a high performance preconditioner.

By utilizing ADVC on BG/L, a drop impact analysis for the full assembly of an actual mobile phone has been performed. Virtually no simplification has been made on the model, which includes detailed inner structures. The size of the model has been as high as 142 million DOFs. While the size may not seem large, the drop impact analysis of a product model requiring the contact condition on the entire surface of the outer case could not be handled by the existing CAE technology.

The analysis described in this paper: the detailed model, the parallel performances of the application, and the high performance parallel computer system, which lead to the prospective product design analysis and computational mechanics of the future.

In Section 2, a brief description on BG/L is given, and in Section 3, ADVC’s repetitive algorithm, the CGCG method, is described. In preparation for this, the general algorithm of the DD method is explained in terms of K-orthogonal direct sum decomposition and the direct method. In Section 3.3, a general description of preconditioning for the DD method, in so-called additive Schwarz form, is given. In Section 3.4 and 3.5, BDD

Gordon Bell Prize winner M. F. Adams et al ([9]).

2

Page 3: Large Scale Drop Impact Analysis of Mobile Phone Using ...sc06.supercomputing.org/schedule/pdf/gb112.pdf · structural analysis code, ADVC, on Blue Gene/L (BG/L). The numerical algorithm

method is explained. We enhance and expand BDD to CGCG in Section 3.6. CGCG is much simpler than BDD or the standard DD method form. In this paper, we have given a long introduction to the CGCG method. In Section 2.1 and 2.1, some indispensable techniques used for our analysis are briefly described. In Section 3.9, comments on the ADVC pre- and postprocessor processing large data, such as the solver handles, are given. In Section 4, the performance of ADVC is described, comparing it with BDD and evaluating the parallel performance on BG/L. In Section 0, the major part of this paper, analysis of a mobile phone model mentioned above is described.

2. Architecture of Blue Gene/L The BG/L system ([13]) represents a new approach to building supercomputers, which has pushed the limits of massively parallel computing. The BG/L system design was driven, at every level, with a strong focus on power efficiency. It uses low power processors, which allows a large number of processors to be packed into a given volume. A single rack of BG/L, which is air-cooled, has 2048 processors with an aggregated peak performance of 5.7 TFLOPS. BG/L uses system-on-a-chip technology to integrate powerful torus and collective networks, and it uses a novel software architecture ([14]) to support high levels of scalability. A 64K node BG/L system, with a peak performance of 367 TFLOPS, was successfully deployed at Lawrence Livermore National Laboratory in September 2005.

2.1 BG/L Hardware Each BG/L node ([13]) has two 32-bit embedded PowerPC (PPC) 440 processors, which have 32 KB each of L1 data and instruction caches. The BG/L nodes support prefetching in hardware, based on detection of sequential data access. The prefetch buffer for each processor holds 64 L1 cache lines and is referred to as the L2 cache. Each chip also has a 4 MB L3 cache built from embedded DRAM, and an integrated DDR memory controller. A single BG/L node supports 512 MB or 1 GB memory. The PPC 440 processor does not support hardware cache coherence at the L1 level. There are, however, instructions to invalidate a cache line or flush the cache which can be used to manage coherence in software.

BG/L employs a SIMD-like extension of the PPC floating-point unit, which we refer to as the double floating point unit or DFPU ([15]). The DFPU adds a secondary FPU to the primary FPU as a duplicate copy with its own register file. BG/L supports a comprehensive set of parallel instructions on double-precision floating-point data.

The BG/L ASIC supports five different networks: torus, collective, global interrupts, Ethernet, and JTAG. The main communication network for point-to-point messages is a three-dimensional torus. Each node contains six bi-directional links for direct connection with nearest neighbors. The raw hardware bandwidth for each torus link is 2 bits/cycle (175 MB/s at 700 MHz) in each direction. The torus network provides both adaptive and deterministic minimal path routing in a deadlock-free manner. The collective network implements broadcasts and reductions with low latency. The global interrupts network supports a fast barrier operation, with a measured latency (at the MPI level) of 1.6 microseconds for a 64K node system. On BG/L, I/O is supported via special I/O nodes, which are architecturally identical to compute nodes, but are attached to the Gbit/s Ethernet network, which connects the BG/L core to external file servers and host systems. The booting, control, and monitoring of the BG/L system is done over the JTAG network.

2.2 BG/L Software The programming model supported in BG/L is single program multiple data (SPMD), with message passing supported via an implementation of the Message Passing Interface (MPI). A BG/L job can be submitted in one of two modes. In coprocessor mode (CPM), which is the default mode, a single application (MPI) process runs on each compute node – one of the processors of the compute node is used for computation, and the other is used for offloading part of the communication operations. In virtual processor mode (VNM), two application processes are run on each compute node, one on each of the two processors.

BG/L uses a hierarchical organization of software, described in further details in [13]. User applications run exclusively on compute nodes under the supervision of a simple, minimalist compute node kernel (CNK). The I/O nodes run a customized version of Linux. Many system calls (such as “read” and “write”) are not directly executed in the compute node, instead they are function shipped through the collective network to the “parent” I/O node. The control system is implemented as a collection of processes running in an external computer, called the service node for the machine. All of the visible state of BG/L is maintained in a commercial database on the service node.

BG/L provides an operating environment with a very low level of computational noise (interference from operating system activity). It also supports low latency communication (latency to nearest neighbor is about

3

Page 4: Large Scale Drop Impact Analysis of Mobile Phone Using ...sc06.supercomputing.org/schedule/pdf/gb112.pdf · structural analysis code, ADVC, on Blue Gene/L (BG/L). The numerical algorithm

3.3 microseconds, or 2350 processor cycles), with a low half-bandwidth point (half of asymptotic bandwidth is achieved at a message size smaller than 1 KB for several MPI bandwidth tests, such as point-to-point sends to all nearest neighbors and “alltoall” collective operation).

3 Architecture of CGCG Method of ADVC

3.1 Introduction The CGCG (Coarse Grid based CG) method ([10]) has been developed for structural finite element analysis (FEA), especially on parallel environments. The CGCG method is a CG method with domain decomposition preconditioned by a motion of the decomposed subdomains. It also can be viewed as a multi-grid algorithm that takes the subdomains as a coarse grid. In the utilization of the domain decomposition (DD) method, there are leading researchers and state-of-the-art algorithms. However, those algorithms are using the direct method on each subdomain. The direct method is almost always heavy. We took the CG method on the entire space, giving up the direct method.

In order to clarify the difference between standard DD method and CGCG method, we describe the standard DD method and the famous Balancing Domain Decomposition (BDD) ([2], [3]) method first, then move on to describe the CGCG method in Section 3.3. BDD is an enhanced version of the standard DD method, which takes the rigid body motion of the subdomains into account within the algorithm. Compared to BDD, the CGCG method is much simpler. The coarse grids of the two methods are similar but there is a little difference according to their architecture.

3.2 Standard Form of Iterative Domain Decomposition Method

Let be the target domain. A boundary condition is imposed on . A mesh division for FEA is given to . According to the procedures of FEA, the given partial differential equation gives a linear equation of the form

ΩΩ

Ω

Ku f= (1)

that is, a stiffness equation, where and u f are the vectors in the discretized displacement field and external force field, respectively. The stiffness matrix

is a positive definite symmetric matrix, under an appropriate geometric boundary condition. Let be the space of all the degrees of freedom (DOFs). The solution of (1) is an element in . It should be noted that is not a linear form on V but a bilinear form on .

KV

u VK

V

We decompose the domain into the subdomains ΩIΩ . IΩ and JΩ ( I J≠ ) are not allowed to overlap,

but two neighboring subdomains share both boundaries. The set of the subdomains IΩ covers the whole domain Ω . Each IΩ takes over the mesh division for Ω and the boundary conditions given to Ω . The decomposed domains give an exclusive partition, which are the inner domains and the boundaries. Let and be the DOFs on all of the inner domains and all of the boundaries, respectively. includes the DOFs corresponding to the boundary of the whole domain. We call the boundary given by the decomposition, without the boundary of the whole domain, the “inner boundary.”

is a direct sum of the two spaces, that is, the inner domains and the inner boundary:

iV sV

i

s

⎞⎟⎠

⎟⎟

V

V

iV V V= ⊕ (2)

The equation is decomposed block-wise into the following form according to (2):

i

s

uu

u=

⎛ ⎞⎜ ⎟⎝ ⎠

(3)

iii is i

ssi ss s

K K fuK K fu

=⎛ ⎞⎛ ⎞ ⎛⎜ ⎟⎜ ⎟ ⎜⎜ ⎟⎝ ⎠ ⎝⎝ ⎠

(4)

We rewrite this equation into the next two equations:

(i)iiii is

s(re)si ss 0

fK K ufK K

=−

⎛ ⎞ ⎛ ⎞⎛ ⎞⎜ ⎟ ⎜⎜ ⎟

⎝ ⎠ ⎝ ⎠⎝ ⎠ (5)

(s)iii is

ss s(re)si ss

0K K uf fK K u

=+

⎛ ⎞ ⎛ ⎞⎛ ⎞⎜ ⎟ ⎜⎜ ⎟ ⎜ ⎟⎝ ⎠ ⎝ ⎠⎝ ⎠

(6)

The solution of (1) is given by u

(s)i(i)i(i) (s) (i) (s)

s: ,

0uuu u u u uu

⎛ ⎞⎛ ⎞= + = = ⎜⎜ ⎟ ⎜⎝ ⎠ ⎝ ⎠

(7)

Equation (5) assumes the zero Diriclet condition on the inner boundary. s(re) is the reaction force caused by the displacement field on the subdomains. Equation (6) has the Neumann condition s s(re) on the inner boundary. The first equation of (6) corresponds Laplace equation, and in this sense, the solution is called a discrete harmonic function ([16]).

f

f f+

(s)u

4

Page 5: Large Scale Drop Impact Analysis of Mobile Phone Using ...sc06.supercomputing.org/schedule/pdf/gb112.pdf · structural analysis code, ADVC, on Blue Gene/L (BG/L). The numerical algorithm

The first equation of (6) gives the relation between and . (or ) can be viewed as a

function of : s (s) (s) (s)i

s

s (i)(i) (i)

u u u uu

(s)i sii is 0K u K u+ = (8)

In this sense, can be viewed to be a discrete harmonic expansion of .

(s)usu

Let be the space of the discrete harmonic expansion of the inner boundary space V . is given by (7). Let V be the space of . Equations (5) and (6) can be rewritten as follows:

(s)Vu

u

(i) (i)TKu P= f (9) (s) (s)TKu P= f

)u1− ⎞

⎟⎠

(i) (s)

1=

=

f=

f

(10)

where,

(i) (i) (s) (s,u P u u P= = (11) 1

(i) (s)ii is ii is1 0,

0 0 0 1

K K K KP P

− −≡ ≡

⎛ ⎞ ⎛⎜ ⎟ ⎜⎝ ⎠ ⎝

(12)

(i)P and are complementary K-orthogonal projection operators from V to and , respectively, which means:

(s)PV V

( ) ( )2 2(i) (i) (s) (s) (i) (s), ,P P P P P P= = + (13)

(i) (s) (s) (i) 0P P P P= (14) (i) (i) (s) (s),T TP K KP P K KP= = (15)

V is represented as K-orthogonal direct sum:

(i) (s)KV V V= ⊕ (16)

(i)u can be naturally identified with by (7), so we finally have the following equations:

(i)iu

(i) (s)u u u= + (17) (i) i (i)

ii i:u V K u∈ (18) (s) (s) s s s (s) s (s), : Tu P u u V KP u P= ∈ = (19)

Equation (18) corresponds to the DOFs for all the subdomains disposed block-wise diagonally. Therefore, equation (18) is essentially local. However, equation (19) is essentially global in the sense that consists of all the DOFs of the inner boundary of the whole domain.

(s)V

(s)KP in (19) is written as

(s)1

ss si ii is

0 0

0KP

K K K K−=

⎛ ⎞⎜⎝ ⎠

si ii isK K−−

(s)

(20)

The term ss is widely known as Schur complement.

1K K

Equation (18) is solved by the direct method. Usually this calculation is parallelized over the subdomains. As to equation (19), the projective CG method is applied using the projection . P

Together, equations (19) and (20) show that the solving process of the local equation (18) is needed in each CG iteration step.

The method of iterative domain decomposition presented above is suitable for parallel processing, but is poor in practical calculation. The above procedure is nothing more than the distribution of the direct method from the whole domain to the subdomains. In addition, CG iteration for (19) gives rise to overhead on calculation. There might be some applicability to treating large scale problems, but it is not superior to high performance serial iteration methods for small problems. There are two ways to overcome this situation: one is to prepare a strong preconditioner; the other is to reduce the computational load of the direct method.

3.3 Preconditioner for Iterative Domain Decomposition Method

IV denotes the admissible displacement field in the whole domain Ω , defined on IΩ and its boundary. Although IΩ s do not overlap by their definition, IV overlaps with neighboring spaces which share the boundary with IΩ .

However, let jϕ be the finite element basis function defined globally on Ω , then its non-zero restriction

| I

jϕ Ω to IΩ constitutes the basis function on IΩ . IV

ϕ denotes the admissible local displacement field

induced by the restrictions j Ω. | I

IV does not coincide with IV , nor is it even a subspace of V , because of the form of the restriction functions.

Since the two sets of the basis functions IV and IV have a trivial one-to-one correspondence referring to the respective basis functions on each nodal point, a one-to-one linear mapping from IV to IV can be defined. IN represents this linear mapping. IV is a subspace of . Accordingly, V IN can be viewed as an embedding of IV( )TI

into . Its transposed operation is a restriction mapping from V into

VN IV

( )TI.

is a restriction operator, andN IN is a prolongation operator.

5

Page 6: Large Scale Drop Impact Analysis of Mobile Phone Using ...sc06.supercomputing.org/schedule/pdf/gb112.pdf · structural analysis code, ADVC, on Blue Gene/L (BG/L). The numerical algorithm

We consider the following situation. Take a point ω in the whole domain , which corresponds to a DOF

in the global space . Apply the restriction and then the prolongation operator to successively. If

Ωu V

uω is in a substructure IΩ for some I and not on the any of the inner boundaries, returns to the initial DOF in V through a local field

( )TI I uN Nu IV .

However, if the point ω is on the several inner boundaries IΩ for a set of , the corresponding DOF belongs to the subspaces

Iu IV for the set of

I in a duplicated manner. In order to cancel this duplication and to obtain in the global space V after applying the corresponding restrictions and prolongations successively, we define the set of a diagonal mappings (diagonal bilinear form)

u

ID for the set of I that satisfy

( )1TI I I

I

N D N= ∑ (21)

It should be noted that ID is not a linear form but a bilinear form, which is the same as K . The set ID is a partition of unity. Although we can define the diagonal elements in any way as far as equation (21) is satisfied, the simplest way is to define the corresponding entity as

1/(number of subdomains that share the point )ω (22)

Each ID works as an average operation for gathering the DOFs of the duplicated points. Equation (21) includes the case where ω is in IΩ for some I and not on the any of the inner boundaries as the first case stated above.

The global space V is the summation of the subspaces IV :

I

I

V = ∑V (23)

Since IV is obtained by the prolongation of the space IV and the duplication of the DOFs on the boundary can be cancelled by ID , one of the representation of by the spaces V IV is:

I I I

I

V N D= ∑ V (24)

Next, we formulate an iterative DD method as a general preconditioned CG algorithm. Preconditioning is a method to obtain an equivalent equation to (1) with a smaller condition number by multiplying a matrix : G

GKu Gf= (25)

Since K is a bilinear form on V , the same condition should be imposed on G . That is, G is

required to be positive definite symmetric.

We define an equilibrium equation on a local domain IΩ that is obtained by restriction of that on the whole

domain Ω . Let the stiffness equation obtained through the FEA procedures on IΩ be

I I IK u f= (26)

which we call a local stiffness matrix, in contrast with the global stiffness equation (1). According to the relationship between the global and the local basis functions, we have

( )TI I I

I

K N K N= ∑ (27)

I I

I

f N= f∑ (28)

Therefore, the local DOF Iu is represented by the global DOF through the restriction: u

( )TI Iu N= u (29)

In general, since a subdomain does not necessarily have sufficient constraint boundary conditions, the local stiffness matrix IK is semi-positive definite but not necessarily regular. Equation (26) does not necessarily have a solution. If (26) has a solution, it is not necessarily unique. If the local equation (26) has a solution, then Iu obtained by (29) is one of the solutions of (26). If Iu is a solution of (26) for every I , then the global solution is equal to u

II I

I

uu N D= ∑ (30)

by equation (21).

Since IK is semi-positive definite, we take ( )IK−

as one of the generalized inverse matrices and let it be the local precondition matrix IG :

( )I IG K−

= (31)

We define a prolongated local precondition matrix IG by a positive definite symmetric matrix:

( )TI I I I I IG N D G D N= (32)

The expression of the right hand side is a representation of the bilinear form IG . Finally, we define the global precondition matrix by

I

I

G G= ∑ (33)

6

Page 7: Large Scale Drop Impact Analysis of Mobile Phone Using ...sc06.supercomputing.org/schedule/pdf/gb112.pdf · structural analysis code, ADVC, on Blue Gene/L (BG/L). The numerical algorithm

This is additive Schwarz form ([16]) and gives a preconditioned equation (25). In this way, the iterative DD method can be viewed as a preconditioned iterative method.

3.4 Separation of DOFs in Subdomains In order to construct a preconditioner in the iterative DD method concretely, we separate the DOFs in equation (26) associated with a subdomain IΩ in the same way as stated in Section 3.2.

Let be DOFs in sIV IV on the inner boundary connected to the subdomain IΩ , and iIV be the other DOFs in IV i. IV contains the DOFs on the boundary of the whole domain that is also on the inner boundary connected to IΩ by the definition of IV

sI. Similarly, let

be DOFs in V IV on the inner boundary connected to the subdomain IΩ , and iIV be the other DOFs in

IV . sIV sI i does not coincide with , but V IViI

is equal to , by the construction of the basis functions defined on

VIΩ . Therefore, IV can be decomposed as

i sI IV V V= ⊕

⎞⎟⎟⎠

I (34)

The local equation (26) is decomposed block-wise into the equation in the same way as (4):

iii is i

ssi ss s

I I II

I I II

K K fu

K K fu=

⎛ ⎞ ⎛⎛ ⎞⎜ ⎟ ⎜⎜ ⎟⎜ ⎟⎜ ⎟ ⎜

⎝ ⎠⎝ ⎠ ⎝ (35)

iiIK is regular because is the inner DOF of the

subdomain. The restriction , prolongation

iIV( )TIN IN

and the partition of unity are also decomposed block-wise into:

( ) ( )s s

s

1 01 0, ,

0 0

1 0

0

TI ITI I

II

N NN N

DD

= =

=

⎛ ⎞⎛ ⎞⎜ ⎟⎜ ⎟ ⎜ ⎟⎝ ⎠ ⎝

⎛ ⎞⎜ ⎟⎝ ⎠

⎠ (36)

sIN is a prolongation from sIV into sIV . s

ID is a diagonal matrix on which entities consist of the weights defined in the same way as (22), given through the construction of the prolongation.

sIV

From the relationship between the global and local stiffness matrices (27), we obtain

( )( )

ii is sii is

si sss si s ss s

TI I I

TI I I I II

K K NK K

K K N K N K N=

⎛⎛ ⎞ ⎜⎜ ⎟ ⎜⎝ ⎠ ⎜

⎝ ⎠∑

⎞⎟⎟⎟

(37)

( )s ss s

TI I IN K N is a summation of duplication only on the inner boundaries.

We define complementary IK -orthogonal projection operators on IV in the same way as equation (12):

( ) ( )1 1

(i) (s)ii is ii is1 0,

0 0 0 1

I I II I

IK K KP P

− −−

≡ ≡⎛ ⎞ ⎛⎜ ⎟ ⎜⎜ ⎟ ⎜⎝ ⎠ ⎝

K ⎞⎟⎟⎠

IV is decomposed into the IK -orthogonal direct sum of the space (i)IV which has zero displacement on the inner boundary and the space of the local discrete harmonic functions that gives zero reaction force in the subdomain:

(s)IV

(i) (s)I

I I IK

V V V= ⊕ (39)

(i) (i) (s) (s),I I I I IV P V V P V= = I (40)

By definition, (i)IV is equal to . iIV

Furthermore, we decompose the local equation (26) into the two equations for DOFs associated with iIV and

sIV : (i) (s)I I Iu u u= + (41)

(i) i (i)ii i:I I I Iu V K u f∈ I= (42)

(s) (s) s s s (s) s (s), :I I I I I I I I I T Iu P u u V K P u P f= ∈ = (43)

In terms of equation (42), the space of the global inner DOFs is the direct sum of the space

(i)V(i)IV of the

local inner DOFs:

(i) (i)I

IV V= ⊕ (44)

The equations in (42) for all the subdomains are equivalent to the global equation (18), that is, equation (18) in Section 3.2 is obtained by the direct sum of the local equations (42).

In the formalization of the DD method in Section 3.2 and in each iteration step, the displacement on the inner boundary is obtained as an approximation. Accordingly, the subdomains are constrained in each step. However, in the process of construction of a precondition matrix (33) as described in Section 3.3, as far as each local domain is considered independently along with the additive Schwarz form (33), the configuration of the subdomains is not necessarily determined uniquely. This indefinite rigid body displacement of subdomains is called a “floating” substructure ([2]). This is caused by the fact that the local stiffness matrix

su

IK( )I

in equation (26) is not necessarily regular. A generalized inverse matrix K

of IK is defined only for a convenience of the calculation induces floating substructures.

Along with the description above and Section 3.3, natural

7

Page 8: Large Scale Drop Impact Analysis of Mobile Phone Using ...sc06.supercomputing.org/schedule/pdf/gb112.pdf · structural analysis code, ADVC, on Blue Gene/L (BG/L). The numerical algorithm

construction of the precondition is widely known as Neumann preconditioner. We take V in the K-orthogonal decomposition (16) into the representation (23) of the global space :

(i)

V

(i) (s) I

I

V V P V= + ∑ (45)

We apply the additive Schwarz form (33) to this expression. We define the local operators for preconditioning:

(i) (i) 1 (i)ii

TG P K P−≡ (46)

( ) ( )(i) (i) (s) (s)I I I I T I I IG P K P P K P−

≡ + T−

(s)

(47)

As to definition (47), the first term is canceled by the projection . Accordingly, or defining directly, P

( )(s) (s)I I I IG P K P−

= T (48)

Applying the prolongation and projection to IG , we take IG as:

( )(s) (s)TI I I I I I TG P N D G D N P= (49)

Substituting (46) and (49) into (33), we obtain a global precondition matrix:

(i) I

I

GG G= + ∑ (50)

The second term in equation (50) is called the Neumann precondition.

3.5 Balancing Domain Decomposition Mandel ([2]) constructed a powerful preconditioner in the iterative DD method by eliminating the floating motion of the subdomains in the process of the Neumann preconditioning iteration. He named this algorithm Balancing Domain Decomposition (BDD) method. The algorithm of BDD can be summarized as follows: The displacement in each subdomain is divided into the rigid body displacement, i.e. the floating motion, and the deformation that brings strain. The two motions are globally solved independently and one extracts the floating motion from the residual vector of the reaction force in the process of the CG iteration. Mandel called the elimination of DOF that induces the rigid body displacement from reaction residual vector “balancing.”

Based on the discussions stated through Section 3.2 to Section 3.4, BDD algorithm can be seen clearly. Let be the space of DOFs of the rigid body displacement of the boundaries of the subdomains. Let W be the

discrete harmonic functions generated by W . or is called a coarse grid space.

s

(s)

s s

(s)

(i)

(s)

(s) (s)

(s

(s)

(s)

(sa ) (s) (s) (sa )

(sa )WP P P= +

(i) (s)

T f

f

(sa )

s s

W

WW

In Section 3.2, the global space has been represented by K-orthogonally decomposed direct sum of V and

as (16). Note that W is a subspace of . In BDD algorithm, is decomposed further into and its K-orthogonal complementary space V in

:

V

(s)V (s)VV W

a )

(s)V(s) (s) (sa )

KV W V= ⊕ (51)

Along with this decomposition, the projection from onto is decomposed into two K-orthogonal complementary projections and

from V onto W and V , respectively:

(s)PV W

WPP

(s) (s) (52)

Here, both and no longer have an explicit representation as or because the coarse motion cannot be represented by any explicit form. The procedures of operations can only be stated as shown below.

(s)WP (sa )PP P

The global space V is decomposed into three K-orthogonal subspaces:

(i) (s) (sa )K KV V W V= ⊕ ⊕ (53)

This reflects the fact that the decomposition (53) separates the rigid body displacement of the inner boundaries from the global space, although the decomposition (16) only separates the displacement of the inner boundaries.

The local equation (19) is decomposed into the following equations:

(s) (s) (sa )Wu u u= + (54) (s) (s) (s) (s):W W Wu W Ku P∈ = (55)

(sa ) (sa ) s s s (sa ) s (sa ), : Tu P u u V KP u P= ∈ = (56)

Equation (55) is called a coarse grid problem. In general, since the coarse grid problem is small, it is solved by the direct method in the same way as multi-grid algorithm. As to equation (56), the projective CG method is applied using the projection , which will be discussed later. P

The coarse grid problem is solved in the following manner. The coarse grid defines six basic motions which consist of three translational motions and three rotational motions. First, we construct through defining W IW . Let IXα be the coordinate of the node α on the boundary of the subdomain I. Let Ixα be the coordinate after infinitesimal rigid body displacement:

8

Page 9: Large Scale Drop Impact Analysis of Mobile Phone Using ...sc06.supercomputing.org/schedule/pdf/gb112.pdf · structural analysis code, ADVC, on Blue Gene/L (BG/L). The numerical algorithm

(1 )j

j j jj j

OI j Ij P v O Xx P v e X I

α

θα α θ≅ + += + (57)

Here we assumed the summation rule as to j, where

1 2 3

1 0

0 , 1 , 0

0 0

P P P≡ ≡ ≡⎛ ⎞ ⎛ ⎞ ⎛ ⎞⎜ ⎟ ⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟ ⎜ ⎟⎝ ⎠ ⎝ ⎠ ⎝ ⎠

0

1

1

0

⎞⎟⎟⎟⎠

(58)

(59)

1 2 3

0 1

1 , 0 , 1

1 1

O O O

≡ − ≡ ≡

⎛ ⎞ ⎛ ⎞ ⎛⎜ ⎟ ⎜ ⎟ ⎜⎜ ⎟ ⎜ ⎟ ⎜⎜ ⎟ ⎜ ⎟ ⎜⎝ ⎠ ⎝ ⎠ ⎝

jjOe θ is the exponential function of the matrix j

jO θ , and represents the rotation with the rotation axis vector θ ( )1 2 3 T

θ θ θ≡ , rotation direction θ and magnitude θ . The nodal displacement is obtained by equation

(57):

I j I jj ju P v O X v X I

α αθ θ≅ + = + × α (60)

The displacement field I W sj j

e on the boundary of the subdomain I defined by

3 3

: ( ), 0, 1,

: ( )

IW s IWs Ij j j

I W s I W s I Ij j j

e e X Pj

e e X O X

α

α α+ +

==

=2 (61)

constitutes the basis functions of the space sIW whose degree is six. In case the degree of the functions I W s

jje is less than six, depending on the configuration on the whole domain or the boundary conditions given to the subdomain IΩ , appropriate extension of the basis functions should be made. sIW is a subspace of sIV . Applying prolongation with the partition of unity I

ID , we embed the space sIW into the global space . V

s s ss s, sI I I W I I IW

I j jI

W N D W e N D e≡ ≡∑ (62)

The functions s, 0 I

WI j I j me ≤ < are linearly independent

and constitute a basis of . is a subspace of . Furthermore, applying the projection to

sW sW sV(s)P

s, 0 I

WI j I j me ≤ < , we obtain the discrete harmonic

expansion:

(s) (s) sW WI je P e≡ I j (63)

These functions ( ) s, 0 I

WI j mI je ≤ < define the space

of the rigid body displacement as its basis.

(s)W

Now, we go back to the coarse grid problem (55). We take the solution of (55) as:

,

(s) (s)

I j

W I jI ju µ= We∑ (64)

where I jµ are the solution of the equation:

,J k

W s J kI j J k I jK fµ =∑ (65)

The coefficient matrix and the right-hand side vector of this equation are given by the equations:

( ) ((s) (s) (s),TW s W W W

I j J k I j J k I j I jK e K e f e≡ ≡ )Tf

(sa )

s

(66)

In the process of the iteration for equation (56), we need to calculate the projection operation to a vector

in P

u sV . As mentioned above, cannot be represented explicitly, because is the complementary projection operator of , and is determined only by the procedures given above. Instead, we use equation (52) which states complementarity itself. In equation (56), we take an approximation . Then we have

(sa )P(sa )P(s)WP (s)W

s

P

u

(s) (s) su P u= (67)

and its projection to is obtained through the coarse grid as determining in (55), which yields

(s) (s )WP u (s)W(s)W

(sa )

NT W W T TG P K P P K P P G P− −= + +

u

(sa ) (s) (s) (s)WP u u P u= − (68)

Finally, we note the BDD algorithm in the context of the preconditioner given in additive Schwarz form as stated in Section 3.3. In the decomposition (53), we apply the Neumann precondition to the space V . The precondition matrix is determined as:

(i) 1 (i) (s) 1 (s) (sa ) (sa ) (69)

where N is the Neumann precondition matrix defined in (49), described as

G

( ) (

( ) ( )

(s) (s)N

s s s s

0 0

0

TI I I I I T I I

I

TI I I I I

I

G N D P K P D N

N D S D N

=⎛ ⎞⎜ ⎟⎜ ⎟⎜ ⎟⎝ ⎠

) (70)

where IS is the Schur complement of the local stiffness matrix IK . The first two terms of the precondition matrix (69) represent that the components of and

are solved strictly and respectively, and the

(i)V(s)W

9

Page 10: Large Scale Drop Impact Analysis of Mobile Phone Using ...sc06.supercomputing.org/schedule/pdf/gb112.pdf · structural analysis code, ADVC, on Blue Gene/L (BG/L). The numerical algorithm

Neumann precondition CG method is applied in the space , in the K-orthogonal decomposition (53). (sa )V

The essence of the BDD algorithm is that the direct method is applied in and , and the Neumann preconditioned CG iteration is applied to . The floating motion of the subdomains no longer occurs, since the floating displacement is excluded in terms of the coarse grid motion in the coarse grid .

(i)V (s)W(sa )V

(s)W

3.6 CGCG Method We have finally arrived at the point where we can describe the CGCG method. In BDD or the Neumann preconditioned method, the calculation load from the direct method on the local domains and K-orthogonal projection of (16) are heavy. The CGCG method reduces the calculation load of complicated K-orthogonal projection procedure. On the one hand, it makes the whole algorithm much simpler, on the other hand it takes over the performance of the iteration of BDD.

We give up the K-orthogonal projection (16). Instead, we K-orthogonally decompose the global space into the space of the coarse grid and its K-orthogonal complementary subspace directly:

W(a )V

(a )KV W V= ⊕ (71)

W is constructed through superposing the prolongations of the space IW of the rigid body displacement of the local subdomain IΩ .

I I I

I

W N D W≡ ∑ (72)

The basis functions WjI je of the space IW can be

obtained by using nearly the same procedures described in Section 3.5. Although in constructing the coarse grid of BDD, the coordinates span on the boundary of the subdomain through eqations (57) to (63), here in CGCG, the coordinates span in the entire subdomain including the boundary. The basis functions , 0 I

WI j I j me ≤ < on W

are given by:

W I I WI je N D e≡ I j

T f

f

(a ) (a )

(a )

( )W (W

T

(73)

where , are K-orthogonal projections on the global space V into W and V , respectively. cannot be represented explicitly because is the complementary projection operator of , and ) is determined only by the procedures constructing the coarse grid. These circumstances are the same as in the BDD algorithm.

( )WP (a )PP

PP P

The CGCG algorithm in the context of the precondition given in additive Schwarz form stated in Section 3.3 is written as:

1( ) 1 ( ) (a ) (a )K

W W TG P K P P D P−−= + (77)

where KD is the diagonal matrix of K . The first term of this precondition matrix represents that the component of the coarse space W is solved strictly, and then

KD -based precondition is applied in the space . (a )V

As seen above, the algorithm of CGCG method is much simpler. No procedures taking the inner boundary apart from inside of the subdomain are included, and for this reason, the discrete harmonic expansion is not considered. This allows for a large reduction in calculation cost.

Therefore, the coarse grid spaces of CGCG and BDD are a little different. The coarse grid generated in CGCG is rather rough and artificial around the inner boundary.

Equation (1) is decomposed into the following equations:

( ) (a )Wu u u= + (74) ( ) ( ) ( ):W W Wu W Ku P∈ = (75) (a ) (a ) (a ) (a ): Tu V Ku P∈ = (76)

Fig.1 shows examples of coarse motion on a plate with a tetrahedral mesh partition. The plate shown here is a part of a wider plate; we confine our description to the decomposition shown in Fig.1. The plate is decomposed into nine subdomains by four inner boundary surfaces. The inner boundaries of the subdomains are indicated by the arrows in the figure. As stated in Section 3.5, the coarse motion consists of the six basic motions. The figures here represent the translational motion perpendicular to the plate. In the figures (a) and (b), the color contour shows the distribution of the subdomains, whereas in (c) and (d), we see the distribution of the displacement. The coarse motion described here is restricted to these nine subdomains (nine plates) and does not affect the outer area, which are the characteristics of the local motion. What we know from these figures is that the coarse motion of BDD affects and gives distortion to the neighboring subdomains, which are generated by the discrete harmonic expansion of the DOFs on the inner boundary. However, the coarse motion defined by CGCG method is somewhat artificial, which is given only by the partition of unity, or the average operation. In the CGCG method, the average operation is given in this example simply by 1/2 along with (22).

10

Page 11: Large Scale Drop Impact Analysis of Mobile Phone Using ...sc06.supercomputing.org/schedule/pdf/gb112.pdf · structural analysis code, ADVC, on Blue Gene/L (BG/L). The numerical algorithm

Arrows indicate inner boundary

(a) Boundary of CGCG (b) Boundary of BDD

(c) Displacement norm of CGCG (d) Displacement norm of BDD Fig.1 Comparison of the coarse motion of CGCG and BDD method

Arrows indicate inner boundary

3.7 Multi-Point Constraint In practical structural analysis, the analysis objects are in most cases composed of a large number of components. As for electronics products in general, the number of the component bodies of the assembly model goes up to several thousand or several tens of thousand. Components are fixed to one another by screws, press fitting, clamping, or gluing, etc. In some cases, slip-sliding is allowed.

In structural analysis, there are several ways of tying two bodies: mesh generation of shared nodal points, giving constraints directly between the bodies (direct tying), or simulating screws or bolts by a beam element or other means. Boundary conditions are given in any form suited for the circumstances. Except for shared nodal points, multi-point constraint is a technique for taking all these requirements into account and given to the related DOFs. The multi-point constraint technique is also used for contact analysis. The algorithm of the CGCG method should be expanded in that case.

Corresponding to the complexity of the model or the mesh, the number of DOFs associated with multi-point constraint amounts to several tens of percent of the total DOFs in some cases. If DOFs are several tens of million,

then DOFs associated with the multi-point constraint can be several million. Therefore, an efficient algorithm for multi-point constraint is needed.

The multi-point constraint is given by an indefinite equation:

0Bu = (78)

where is a horizontally long matrix. This equation defines a subspace of the global V . The space is given by the projection:

BR R

( ) 11P B BB B−′ ′= − (79)

where B′ is some matrix that can be defined corresponding to the choice for the basis representing the projective space (See for example [17]).

In taking multi-point constraint space into CGCG algorithm, we do not apply the CGCG method directly to

but to the projective space . In terms of the CG algorithm, it has a projective CG formalization. In dealing with the coarse grid, it can also have a form of projective CG, the whole algorithm has doubly projective form . The coarse grid is constructed on the space .

V R

R

11

Page 12: Large Scale Drop Impact Analysis of Mobile Phone Using ...sc06.supercomputing.org/schedule/pdf/gb112.pdf · structural analysis code, ADVC, on Blue Gene/L (BG/L). The numerical algorithm

3.8 Non-linear and Dynamic Analysis Here, we confine ourselves to making only a brief description of analysis functions because the functions used are classical and have little to do with parallel performance, although their implementation is complicated. Geometric and material non-linear analyses are considered in ADVC on the basis of standard incremental analysis. Dynamic analysis is also considered in a standard way. Dynamic motion is represented by the equation

Mu Cu Ku f+ + = (80)

This equation is descretized along the time axis, and it is represented in the same form as the static stiffness equation (1). Dynamic analysis of ADVC can consider both geometric and material non-linearity. The point is that we do not use an explicit method, instead applying the CGCG method even in dynamic analysis, which brings stable calculation in the sense that it does not depend largely on and the size of mesh. This allows larger scale analysis compared to the dynamic analysis using explicit method.

t∆

In the next section, impact analysis, which is given by the contact and dynamic analyses, is performed.

3.9 ADVC Preprocessor and Postprocessor For domain decomposition, ADVC uses the high performance graph partitioning tools Metis and ParMetis ([18]), both of which are used in various systems.

High performance solvers need a preprocessor and postprocessor that can handle large scale analysis. The ADVC preprocessor consists of a CAD interface importing native files of major CAD systems, a high performance mesh generator, and a graphic user interface (GUI) that attaches boundary conditions to the mesh, generating FEA data for the ADVC solver. The ADVC postprocessor visualizes the result obtained by the solver. Both the preprocessor and postprocessor run on Windows and can handle data as large as the solver processes. The analysis models in this paper have been constructed by the ADVC preprocessor, and the visualized analysis results have been obtained by the ADVC postprocessor.

4 Performance of ADVC

4.1 Comparison with BDD We have presented our description of the CGCG method in contrast to and along with the BDD method. In this section we compare the performances of both methods. BDD is implemented in the ADVENTURE system ([1], [19]) tracing the original algorithm. ADVC originated in ADVENTURE (See Section 1), but the framework has

largely been changed. Therefore, two codes cannot be compared directly, but the difference of the performance of BDD and CGCG can be generally seen by comparing the results of the two codes.

Two models are used. One is a two million DOFs machine component (a wheel of a automobile), the other is a six million DOFs machine component (an engine block). Both meshes are given by tetrahedral quadratic elements.

Both CGCG and BDD have the two-layer architecture in domain decomposition. That is, the whole domain is decomposed into parental level domains, and they are decomposed into child level subdomains. We assign the computer nodes (CPUs) to the parental domain, and then define an appropriate number of the subdomains in the parental domains. For this reason, we have performed preparatory analyses by giving the number of the subdomains, and we have chosen the best results from both of the methods. The computer system used is a 32 node system with Pentium4 3.0 GHz, FSB 800 MHz, 2 GB Memory, and Gigabit Ethernet.

The result is shown in Table 1. CGCG is 3.6 times faster than BDD for the two million DOFs model. It should be noted that the memory usage of CGCG is only about 30% that of BDD. As to six million DOFs model, BDD was not able to run on 16 nodes because of memory overflow. It has run on 32 nodes, but memory usage is as large as 1.3 GB. CGCG is 6.1 times faster than BDD and memory usage is only 18% of BDD. These advantages of CGCG over BDD will increase for larger models.

4.2 Parallel Performance of ADVC on Blue Gene

ADVC has been tuned and optimized on BG/L. In order to prioritize the functions to be tuned, the code has been profiled using small experimental data sets. The target functions have been matrix-vector products, or interpolation and prolongation processes between the global domain and the coarse grid. Such examples have been selected, analyzed, and tuned, based on the percentage of time spent in the functions of interest, with the expectation that the functions would remain high-priority across data sets with respect to the potential efficacy of optimizations.

We have tested two models on BG/L. One model is as small as 12 million DOFs (comparatively larger than the existing practical circumstances) of a single volume of a machine component (engine block) with no multi-point constraint connections. The other is a medium size model of 90 million DOFs (too large to handle in the existing practical CAE circumstances), which is given a finer mesh to the

12

Page 13: Large Scale Drop Impact Analysis of Mobile Phone Using ...sc06.supercomputing.org/schedule/pdf/gb112.pdf · structural analysis code, ADVC, on Blue Gene/L (BG/L). The numerical algorithm

small model. We used one rack (1024 BG/L nodes) for the small model and two racks for the medium-sized model.

ADVC allows the option of using the CG method without efficient preconditioning. Since the CG method is the simplest, and has little overhead in parallelization in the ADVC architecture, we can expect high parallel efficiency. For the purposes of estimating the parallel efficiency of the CGCG method, we compare the CG and CGCG methods using the small model. The parallel performances are shown in Table 2. The CGCG method scales up to 1024 nodes even for the small model. The parallel efficiency in

this table is given on the basis of the CPU time of the 64 node result. The parallel efficiency of the CG method is as high as 62% even on 1024 nodes, but the computational speed of CGCG method is 13 times faster than that of the CG method on the same number of nodes. The parallel efficiency shows 40.4%, which is not so poor. These results show the absolute parallel performance of the CGCG method. The memory usage on 1024 nodes is so small that calculation on larger nodes will not be efficient.

The medium size model with 90 million DOFs, can be solved in less than three minutes, as shown in Table 3.

Table 1 Performance of BDD and CGCG

BDD CGCG Model No. of nodes CPU time (s) Memory /Node

(MB) CPU Time (s) Memory /Node

(MB) 2 million 16 176 704 49 202

16 - Overflow 106 482 6 million 32 448 1290 73 238

Table 2 Performance for small model

Number of nodes 64 128 256 512 1024 CG 3678 1917 1023 575 371 CPU time (s)

CGCG 181 93 66 41 28 CG 100 95.9 89.9 80 62 Parallel efficiency

CGCG 100 97.3 68.6 55.2 40.4 CG 113 62 32 16 8 Memory usage

(MB) CGCG 370 134 55 34 23

Table 3 Performance of medium size model Number of nodes 2048

CPU time (s) 179 Memory usage / Node (MB) 67

5 Drop Impact Analysis of Mobile Phone Using ADVC on Blue Gene/L

5.1 Introduction In the design or the reliability evaluation of electronics products, full assembly analysis is one of the targets of the future technology. The difficulty is in the generation of mesh and FEA models through the CAD model, in addition to its analysis scale, which can be large, depending on the details we assume and the number of components. The number of the components of electronics products ranges from several hundred to several tens of thousands. Moreover, this depends greatly on the product under consideration. So analysis of a mobile phone or notebook PC has a cycle time that is considerably shorter than, for example, that of passenger cars.

Simulation tests using prototypes of real product models

are performed intensively in electronics industriy. However some points could be and should be improved. In general, such tests cost a great deal and require a long period of time to perform. In addition, we cannot suppose that all environmental and user circumstances can be imposed on the equipment, and it is difficult to know quantities related to computational mechanics, such as stress or strain distributions in structural mechanics. The current process is good for doing posteriori inference, but it is not suitable for making phenomenological estimations, both qualitatively and quantitatively. Our ideal situation is to use computational mechanics for large part of design and reliability analysis.

The present situation in our field, in general, restricts us to performing two dimensional analysis, assuming symmetry; or zooming, or simplified analysis that confines the analysis area to as local or partial area as possible. For those approximations in analysis,

13

Page 14: Large Scale Drop Impact Analysis of Mobile Phone Using ...sc06.supercomputing.org/schedule/pdf/gb112.pdf · structural analysis code, ADVC, on Blue Gene/L (BG/L). The numerical algorithm

knowledge of mechanics is indispensable and the points we cannot infer must be assumed to exist. Furthermore, even under significant approximation of the model, we need a longer time than is usually expected in order to perform the analysis. Our ideal situation will be achieved only when we can analyze fully assembled product models for whatever analysis we utilize within a realistic and allowable execution time.

Toshiba Corporation has been conducting studies of intensity analysis on electronics products, including the analysis of mobile phones using ADVC, which has yielded as close a performance as we need in addition to its potential extensibility, under cooperation with Allied Engineering Corporation. Through these analyses, a future vision of the design and reliability analysis can be acquired. In the following section, we describe the analysis of a mobile phone which we have conducted.

5.2 Mobile Phone Model In general, there are cases where mobile phone users use their phones under conditions beyond the designer’s assumptions, or imagination. This brings particular difficulties into mobile phone design. Mobile phone manufacturers are gathering broken or used products and investigating the usage conditions that produced unit failure.

The fracture of LCDs (liquid crystal displays) is one of the possible problems in the field of mobile phone design. Although the mechanism of fracture of the LCD is not necessarily clear, such as in the case of a flip phone, several examples can be supposed: (1) static load is

applied to the surface of the protective panel on the LCD, as in a case where the user has it in his/her back trouser pocket and sits; (2) the phone pinches something between its two halves and large load is applied; or (3) a dynamic load is imposed to the upper or lower edge on the front side in its unfolded state, where the user slips it out of his/her hand from the position around his/her ear in the standing posture.

It is desirable to investigate these various scenarios in designing mobile phone. In this paper, we describe the drop impact analysis of a mobile phone. In many cases, impact analysis is performed by an explicit method, but we use an implicit analysis by ADVC (See Section 2.1). If we used an explicit method, we would encounter difficulties in taking time steps depending on mesh size. Moreover, the scale of the model would become larger while the mesh size grows smaller.

The mobile phone model analyzed is shown in Fig.2 to Fig. 4. Although the inner structure is nearly fully considered, some simplifications have been made. We do not assume a condition of contact between the inner devices and structures (See Section 0), and we are focusing on dynamic response of the outer body on the inner structures. We only need one push to the analysis that takes contact condition fully into consideration. The model has been created by a CAD system, and imported by the ADVC preprocessor.

Fig.2 Mobile phone CAD model with half-transparentized outer cases (1)

Fig.3 Mobile phone CAD model with half-transparentized outer cases (2)

14

Page 15: Large Scale Drop Impact Analysis of Mobile Phone Using ...sc06.supercomputing.org/schedule/pdf/gb112.pdf · structural analysis code, ADVC, on Blue Gene/L (BG/L). The numerical algorithm

Fig.4 Cross-section view of the mobile phone model

5.3 FEA Model and Analysis conditions The ADVC preprocessor imports the CAD model, and generates its mesh consisting of quadratic tetrahedral elements. The obtained mesh is shown in Fig.5 to Fig.9. The number of nodes, elements and DOFs are 47,453,366, 32,523,777 and 142,360,098, respectively. The maximum aspect ratio of the mesh is 168.7, the maximum and minimum degrees of contained angle are 168.7 and 2.3, respectively.

To assemble all of the components, the mesh is generated in a form such that bonded surfaces of neighboring two bodies share their nodal points (See Section 2.1), as can be seen in Fig.9.

The functions of the dynamic, contact, and elastic analysis of ADVC are considered (See Section 2.1). The contact analysis is used to simulate impact load applied on the outer case from the ground. The surface of the outer case is set to be the slave surface whereas the ground is set to be the master surface. The ground is assumed to be rigid. The state of the drop can be taken in any way, that is, folded or unfolded, and at any angle, height or initial velocity, etc.

The contact condition, boundary condition and other analysis conditions have been set by the ADVC preprocessor (See Section 3.9), as well as the mesh generation.

(a) Upper case (rear side) (b) Lower case (front side) Fig.5 Mesh for outer cover

(a) Main LCD panel (b) Sub LCD panel Fig.6 LCD pannel

15

Page 16: Large Scale Drop Impact Analysis of Mobile Phone Using ...sc06.supercomputing.org/schedule/pdf/gb112.pdf · structural analysis code, ADVC, on Blue Gene/L (BG/L). The numerical algorithm

(b) Lower PCB and devices (rear side)

Fig.8 Lower PCB (Printed Circuit Board)

Fig.7 Rear side of upper case and sub LCD panel

(a) Lower PCB and devices

(front side)

Fig.9 Small devices on rear side of lower PCB

16

Page 17: Large Scale Drop Impact Analysis of Mobile Phone Using ...sc06.supercomputing.org/schedule/pdf/gb112.pdf · structural analysis code, ADVC, on Blue Gene/L (BG/L). The numerical algorithm

5.4 FEA Results We have performed one hundred steps from immediately after the crash, which is about 2.4 ms, on four racks of BG/L, 4096 nodes. The size of the input and output data are about 3.5 GB and 83.5 GB, respectively.

The analysis results obtained are shown in Fig. 10, Fig. 11 and Fig. 12. In order to obtain these figures, the ADVC postprocessor has been used (See Section 3.9). Despite a size as large as 47 million nodal points, we have been able to visualize rather smoothly.

Fig.10 shows the time variation of the deformation of the entire model viewed from the left-hand side, where the color contour is the equivalent stress distribution. The impact force generated on the end edge of the lower case is stronger than that on the upper case, since the lower body includes the internal battery and is heavier than the upper body.

At the first step, 0.05 ms, the lower end edge of the lower case crashes on the ground, and in 0.1 ms after the crash, the upper end edge of the upper case crashes (Fig.10 (2), 0.15 ms). The condition that the crash of the upper end edge follows the crash of the lower end edge depends on the initial setting of the position of the whole body.

The impulse wave of the equivalent stress is propagated through the lower case, stagnated for a while, being blocked by the center hinge (Fig. 10 (4) - (7), 0.250 - 0.697 ms), and then propagated into the upper case over the hinge (Fig. 10 (9) - (13), 0.945 - 1.74 ms). The equivalent stress reaches a large value when the wave from the lower case propagates through the hinge. The edge of the upper case bounds upward at Fig. 10 (5), 0.449 ms, and then crashes on the ground again at Fig.10 (14), 1.84 ms. The impact of the second crash of the upper case is larger (Fig.10 (15) and (16), 1.89 – 1.99 ms) than the first one (Fig.10 (4)). This stress wave collides with the wave from the lower case (Fig. 10 (17) and (18), 2.04 - 2.19 ms). The center hinge crashes on the ground at Fig.10 (16), 1.99 ms, and it generates stress on the hinge (Fig. 10 (17), 2.04 ms). The peak of the equivalent stress throughout the time appears around final step, 2.44 ms.

The deformation follows a little behind the stress distribution. The whole body bends in the unfolded

direction throughout the real time, but the deformation of the lower and upper bodies are larger, and even upward bending can be seen (Fig.10 (9), 1.19 ms).

The result described above that the center hinge crashes on the ground and the lower and upper bodies bend largely along with the time scale coincides with our knowledge.

Fig. 11 shows the time variation of the equivalent stress distribution appeared on several devices on the lower PCB (printed circuit board). The deformation is not superposed contrary to Fig. 10, although the devices are in reality moving up and down largely. Fig. 11 (2), 0.200 ms corresponds to Fig. 10 (3), where the stress appeared on the lower edge but it exists only locally, and the bending has not yet been largely seen. The generated stress on the board and devices are small. Fig. 11 (7), 2.19 ms corresponds to the state where the whole body bends most. The stress on the two devices in the center of this figure appears largest thorough the analysis.

Fig. 12 is an example of the equivalent stress distribution at 0.945 ms with the deformation amplified 30 times superposed. In these figures, we can see a propagation of the stress wave on the board depending on the bending of the board. Stress concentrations can be seen on the bases of the devices. From Fig. 12, the deformation of the upper side of the board in the direction of shrinkage to the center can be seen, in this time step.

Fig. 12 looks very clear. In a standard analysis with a standard coarser mesh, we frequently see spots scattered on stress distributions caused by lower accuracy in FEA discretization. Those clear contours are due to the fine mesh we generated.

From the view point of the mobile phone design, it should be emphasized that we have obtained the way to evaluate the stress and deformation not only on the body of the phone but also on the inner structures and devices. Although we are able to know the behavior of the outer case by observation through prototype tests, we do not have any way to estimate in our present situation of product design as to the inner structures. In terms of the results, however, we need to continue to evaluate and validate the analysis results through prototype tests and further calculations.

17

Page 18: Large Scale Drop Impact Analysis of Mobile Phone Using ...sc06.supercomputing.org/schedule/pdf/gb112.pdf · structural analysis code, ADVC, on Blue Gene/L (BG/L). The numerical algorithm

(1) Step 2, 0.0502 ms

(2) Step 8, 0.150 ms

(3) Step 10, 0.200 ms

(4) Step 12, 0.250 ms

(5) Step 20, 0.449 ms

(6) Step 26, 0.600 ms

18

Page 19: Large Scale Drop Impact Analysis of Mobile Phone Using ...sc06.supercomputing.org/schedule/pdf/gb112.pdf · structural analysis code, ADVC, on Blue Gene/L (BG/L). The numerical algorithm

(7) Step 30, 0.697 ms

(8) Step 34, 0.796 ms

(9) Step 40, 0.945 ms

(10) Step 50, 1.19 ms

(11) Step 60, 1.44 ms

(12) Step 64, 1.54 ms

19

Page 20: Large Scale Drop Impact Analysis of Mobile Phone Using ...sc06.supercomputing.org/schedule/pdf/gb112.pdf · structural analysis code, ADVC, on Blue Gene/L (BG/L). The numerical algorithm

(13) Step 72, 1.74 ms

(14) Step 76, 1.84 ms

(15) Step 78, 1.89 ms

(16) Step 82, 1.99 ms

(17) Step 84, 2.04 ms

(18) Step 90, 2.19 ms

20

Page 21: Large Scale Drop Impact Analysis of Mobile Phone Using ...sc06.supercomputing.org/schedule/pdf/gb112.pdf · structural analysis code, ADVC, on Blue Gene/L (BG/L). The numerical algorithm

(19) Step 100, 2.44 ms

Fig. 10 Results of variation of deformation of the mobile phone body

(1) Step 2, 0.050 ms

(2) Step 10, 0.200 ms

21

Page 22: Large Scale Drop Impact Analysis of Mobile Phone Using ...sc06.supercomputing.org/schedule/pdf/gb112.pdf · structural analysis code, ADVC, on Blue Gene/L (BG/L). The numerical algorithm

(3) Step 20, 0.449 ms

(4) Step 40, 0.945 ms

(5) Step 60, 1.440 ms

22

Page 23: Large Scale Drop Impact Analysis of Mobile Phone Using ...sc06.supercomputing.org/schedule/pdf/gb112.pdf · structural analysis code, ADVC, on Blue Gene/L (BG/L). The numerical algorithm

(6) Step 80, 2.10ms

(7) Step 90, 2.19 ms

(8) Step 100, 2.44 ms

Fig. 11 Results of time variation of the equivalent stress distributions on the devices on the lower PCB

23

Page 24: Large Scale Drop Impact Analysis of Mobile Phone Using ...sc06.supercomputing.org/schedule/pdf/gb112.pdf · structural analysis code, ADVC, on Blue Gene/L (BG/L). The numerical algorithm

(1) Without mesh (2) With mesh Fig. 12 Result of equivalent stress distribution on the devices and the lower PCB, at step 40, 0.945 ms (deformation

amplified 30 times superposed) 5.5 Performances of ADVC As to the calculation time, it has taken 850 seconds for two steps and about 12.1 hours for totally one hundred steps. The model size does not seem exceptionally, 142 million DOFs at most, but the drop impact analysis with the contact condition on the entire surface of the outer case could not be handled without high performance code like ADVC and parallel system like BG/L. On the whole, our system has shown very good performance, it took only half a day for this analysis. The floating point operation performance obtained has been 538 GFLOPS on 4096 node of BG/L. This corresponds to 134.5 MFLOPS per node.

As has been stated frequently, it is not an easy task to parallelize implicit structural FEA codes. In these circumstances, Salinas ([8]) showed remarkable results of 292.5 GFLOPS on 2940 processors of ASCI Red ([20]), which corresponds to 99.5 MFLOPS per processor, and also showed 1.16TFLOPS on 3375 processors of ASCI White ([8]), which corresponds to processors 343.7 MFLOPS per processor. The authors M. Bhardwaj et al won the Gordon Bell Prize in SC2002.

Our obtained performance is 1.8 times of that of Salinas on 2940 processors of ASCI Red and 0.46 times of Salinas on 3375 processors of ASCI White. It should be noted that the performances of Salinas stated above is obtained by the static analysis for a simple cube problem (the cube is used in order to evaluate parallel performance). On the contrary, our result is obtained by the implicit dynamic impact analysis for a complicated real product model. How is the real performance of ADVC for realistic product model in CAE field? Our advantage is in this viewpoint.

M. F. Adams et al ([9]) were also the Gordon Bell Prize winners in SC2004. They showed 470 GFLOPS on 4088

processors of ASCI White. They solved a nonlinear stiffness equation discretized in FEA form by AMG (Algebraic Multi-Grid) method. Their field was biomechanics, a little different from CAE structural analysis in industries and our standpoint.

6 Conclusions Large scale analysis is emerging as a major and realistic aspect of the design analysis of industrial products. In order to conduct a large scale analysis, the CGCG algorithm for parallel structural analysis has been developed and implemented in commercial FEA system ADVC. We have tried a challenging analysis using ADVC on a high performance computer system, Blue Gene/L (BG/L), which would show the way to the prospective product design analysis and computational mechanics in the future.

The CGCG algorithm has been described in contrast to the domain decomposition method, especially to the BDD method. CGCG has a simpler algorithm than the standard form of the DD method and, accordingly, gives lighter calculation load. The algorithm consists in the global CG iteration preconditioned with the coarse grid problem (See Section 3).

The performance of CGCG has been compared with BDD (See Section 4.1). ADVC is tuned and optimized on BG/L at IBM T. J. Watson Research Center (WRC). ADVC and BG/L showed remarkable parallel performance (See Section 4.2).

We have conducted a drop impact analysis for a product model of a mobile phone (See Section 0). The model we analyzed has 47 million nodal points and 142 million DOFs. The drop analysis with the contact condition on the entire surface of the outer case could not be handled

24

Page 25: Large Scale Drop Impact Analysis of Mobile Phone Using ...sc06.supercomputing.org/schedule/pdf/gb112.pdf · structural analysis code, ADVC, on Blue Gene/L (BG/L). The numerical algorithm

with the existing commercial CAE systems. Our analysis is an unprecedented attempt in the electronics industry. ADVC on BG/L has showed a very good performance. It took only half a day, 12.1 hours, on four racks 4096 nodes of BG/L for the analysis of about 2.4 ms throughout from the first crash of the phone with unfolded form to the crash of the center hinge on the ground. The floating point operation performance obtained has been 538 GFLOPS. This performance has been compared with that of Salinas, the SC2002 Gordon Bell winner (See Section 0).

The result that the center hinge crashes on the ground along with the time scale coincides with our knowledge. From the viewpoint of mobile phone design, it is important to have obtained a way to investigate the propagation of equivalent stress distribution and deformation quantitatively using a real model based on CAD for both the outer case and inner structures, although we need to continue to evaluate and validate the analysis results through prototype tests and further calculations (See Section 2.1).

The mesh and FEA input data have been generated by ADVC preprocessor. The analysis results have been visualized by ADVC postprocessor (See Section 3.9).

In conducting their work, the authors have shared and cooperated with one another and others. Allied Engineering Corporation (AE), University of Tokyo and Keio University are the developers of our code ADVC. AE people have produced the mesh model of the mobile phone and constructed the FEA model. WRC and IBM India Research Laboratory have taken charge of the tuning and optimization of ADVC. WRC has provided the BG/L system. Toshiba Corporation has provided the mobile phone CAD model and their expertise in product design. NIWS Co., Ltd has supported our work and provided their BG/L.

Acknowledgements Finally, we express special thanks to Mr. Craig Stunkel and Mr. Fred Mintzer at WRC. Mr. Stunkel has been organizing our project and Mr. Mintzer has been arranged the BG/L system. Also we would like to express our appreciation to Mr. Nobuyuki Koizumi and Dr. Jung-kook Hong at IBM Japan for their continuous and careful attention to our project.

References [1] H. Akiba, M. Suzuki and T. Ohyama, Introduction

to domain decomposition method, Journal of Simulation Technology, four serial articles: Vol.22, No.2, 111-117, 2003, Vol. 22, No.3, 174-181, 2002, Vol.22, No.4, 261-279, 2002, Vol.23, No.1, 48-55,

2003 (in Japanese) [2] J. Mandel, Balancing domain decomposition,

Communications on Numerical Methods in Engineering, 9, 233-341, 1993

[3] J. Mandel, M. Brezina, Balancing domain decomposition: Theory and performance in two and three dimensions, 1992 http://casper.cs.yale.edu/mgnet/www/mgnet-papers.html

[4] C. Farhat, F.-X. Roux: Implicit parallel processing in structural mechanics, Computational Mechanics Advances, 2, 1-124, 1994

[5] C. Farhat and F. X. Roux, A method of finite element tearing and interconnecting and its parallel solution algorithm, International Journal for Numerical Methods in Engineering. 32, 1205-1227, 1991

[6] C. Farhat, J. Mandel, F. X. Roux, Optimal convergence properties of the FETI domain decomposition method. Computer Methods in Applied Mechanics and Engineering, 115, 367-388, 1994

[7] C. Farhat, M Lesoinne, P. LeTallec, K. Pierson and D. Rixen, FETI-DP: a dual-primal unified FETI method-partI: A faster alternative to the two-level FETI method, Int. J. Numer. Meth. Engng. 50, 1523-1544, 2001

[8] M. Bhardwaj, K. Pierson, G.Reese, T. Walsh, D. Day, K. Alvin, J. Peery, C. Farhat, and M. Lesoinne, Sanlias: A scalable software for high-performance structural and solid mechanics simulations, Technical Papers of SC2002, 2002

[9] M. F. Adams, H. H. Bayraktar, T. M. Keaveny, P. Papadopoulos, Ultrascalable Implicit Finite Element Analyses in Solid Mechanics with over a Half a Billion Degrees of Freedom, Technical Papers of SC2004, 2004

[10] M. Suzuki, T. Ohyama, H. Akiba, S. Yoshimura and,H. Noguchi, Development of fast and robust parallel CGCG solver for large scale finite element analyses, Transactions of the Japan Society of Mechanical Engineers, Series A, Vol. 68, 1010-1017, 2002 (in Japanese)

[11] ADVENTURE project website, http://adventure.q.t.u-tokyo.ac.jp/

[12] S. Yoshimura, R. Shioya, H. Noguchi and T. Miyamura, Advanced general-purpose computational mechanics system for large scale analysis and design, Journal of Computational and Applied Mathematics, Vol.149, 279-296, 2002

[13] N. R. Adiga et al, An overview of the BlueGene/L supercomputer. In SC2002 – High performance networking and computing, Baltimore, MD, 2002

[14] G. Almasi, R. Bellofatto, J. Brunheroto, C. Cascaval, J. Castaños, L. Ceze, P. Crumley, C.

25

Page 26: Large Scale Drop Impact Analysis of Mobile Phone Using ...sc06.supercomputing.org/schedule/pdf/gb112.pdf · structural analysis code, ADVC, on Blue Gene/L (BG/L). The numerical algorithm

Erway, J. Gagliano, D. Lieber, X. Martorell, J. Moreira, A. Sanomiya, and K. Strauss, An Overview of the BlueGene/L system software organization (Distinguished Paper). Proceedings of the 2003 International Conference on Parallel and Distributed Computing (Euro-Par 2003), 543-555, 2003

[15] L. Bachega, S. Chatterjee, K. Dockser, J. Gunnels, M. Gupta, F. Gustavson, C. Lapkowski, G. Liu, M. Mendell, C. Wait and T.J.C. Ward, A high-performance SIMD floating point unit design for BlueGene/L: Architecture, Compilation, and Algorithm Design. Parallel Architecture and Compilation Techniques (PACT 2004), 2004

[16] B. F. Smith, P. E. Bjrstad and W. D. Gropp: Domain decomposition : parallel multilevel methods for elliptic partial differential equations, Cambridge University Press, 1996

[17] T. Miyamura, Incorporation of multipoint constraints into the balancing domain decomposition method and its parallel implementation, to be published in International Journal for Numerical Methods in Engineering, DOI: 10.1002/nme.1766

[18] Metis Website, http://glaros.dtc.umn.edu/gkhome/views/metis/

[19] M. Ogino R. Shioya, H. Kawai and S. Yoshimura, Seismic response analysis of nuclear pressure vessel model with ADVENTRUE System on the Earth Simulator, Journal of The Earth Simulator, 2, 41-54, 2005

[20] ASCI Red website, http://www.sandia.gov/ASCI/Red/

[21] ASCI White website, http://www.llnl.gov/asci/platforms/white/

26