Enabling scalable parallel implementations of structured ...parashar/Papers/... · Grid Adaptive...

J Supercomput (2007) 39: 177–203DOI 10.1007/s11227-007-0110-z

Enabling scalable parallel implementations ofstructured adaptive mesh refinement applications

Sumir Chandra · Xiaolin Li · Taher Saif ·Manish Parashar

Published online: 28 February 2007© Springer Science+Business Media, LLC 2007

Abstract Parallel implementations of dynamic structured adaptive mesh refinement(SAMR) methods lead to significant runtime management challenges that can limittheir scalability on large systems. This paper presents a runtime engine that addressesthe scalability of SAMR applications with localized refinements and high SAMR effi-ciencies on large numbers of processors (upto 1024 processors). The SAMR runtimeengine augments hierarchical partitioning with bin-packing based load-balancing tomanage the space-time heterogeneity of the SAMR grid hierarchy, and includesa communication substrate that optimizes the use of MPI non-blocking communi-cation primitives. An experimental evaluation on the IBM SP2 supercomputer usingthe 3-D Richtmyer-Meshkov compressible turbulence kernel demonstrates the effec-tiveness of the runtime engine in improving SAMR scalability.

Keywords Structured adaptive mesh refinement · SAMR scalability · Hierarchicalpartitioning · Bin-packing based load-balancing · MPI non-blocking communicationoptimization · 3-D Richtmyer-Meshkov application

S. Chandra (�) · X. Li · T. Saif · M. ParasharThe Applied Software Systems Laboratory, Department of Electrical and ComputerEngineering, Organization to Rutgers, The State University of New Jersey, 94 Brett Road,Piscataway, NJ 08854, USAe-mail: [email protected]

X. LiScalable Software Systems Laboratory, Department of Computer Science, Oklahoma StateUniversity, 219 MSCS, Stillwater, OK 74078, USAe-mail: [email protected]

T. Saife-mail: [email protected]

M. Parashare-mail: [email protected]

178 S. Chandra et al.

1 Introduction

Dynamic adaptive mesh refinement (AMR) [3, 5, 29] methods for the numerical solu-tion to partial differential equations (PDEs) employ locally optimal approximations,and can yield highly advantageous ratios for cost/accuracy when compared to meth-ods based upon static uniform approximations. These techniques seek to improve theaccuracy of the solution by dynamically refining the computational grid in regionswith large local solution error. Structured adaptive mesh refinement (SAMR) meth-ods are based on uniform patch-based refinements overlaid on a structured coarsegrid, and provide an alternative to the general, unstructured AMR approach. Thesemethods are being effectively used for adaptive PDE solutions in many domains,including computational fluid dynamics [2, 4, 24], numerical relativity [9, 10], as-trophysics [1, 6, 18], and subsurface modeling and oil reservoir simulation [22, 30].Methods based on SAMR can lead to computationally efficient implementations asthey require uniform operations on regular arrays and exhibit structured communica-tion patterns. Furthermore, these methods tend to be easier to implement and managedue to their regular structure.

Parallel implementations of SAMR methods offer the potential for accurate solu-tions of physically realistic models of complex physical phenomena. These imple-mentations also lead to interesting challenges in dynamic resource allocation, data-distribution, load-balancing, and runtime management [7]. Critical among these isthe partitioning of the adaptive grid hierarchy to balance load, optimize communica-tion and synchronization, minimize data migration costs, and maximize grid quality(e.g., aspect ratio) and available parallelism. While balancing these conflicting re-quirements is necessary for achieving high scalability of SAMR applications, identi-fying appropriate trade-offs is non-trivial [28]. This is especially true in the case ofSAMR applications with highly dynamic and localized features and requiring highSAMR efficiencies1 [5], as they lead to deep hierarchies and small regions of refine-ment with unfavorable computation to communication ratios.

The primary objective of the research presented in this paper is to address thescalability requirements of such applications on large numbers of processors (upto1024 processors). In this paper, we present a SAMR runtime engine consisting oftwo components: (1) a framework that augments hierarchical partitioning with bin-packing based load-balancing to address the space-time heterogeneity and dynamismof the SAMR grid hierarchy, and (2) a SAMR communication substrate that is sen-sitive to the implementations of MPI non-blocking communication primitives and isoptimized to reduce communication overheads. The SAMR runtime engine as wellas its individual components are experimentally evaluated on the IBM SP2 super-computer using a 3-D Richtmyer-Meshkov (RM3D) compressible turbulence kernel.Results demonstrate that the proposed techniques individually improve runtime per-formance, and collectively enable the scalable execution of applications with local-ized refinements and high SAMR efficiencies, which was previously not feasible.

The rest of the paper is organized as follows. Section 2 presents an overview ofstructured adaptive mesh refinement and related SAMR research efforts. Section 3

1AMR efficiency is the measure of effectiveness of AMR and is computed as one minus the ratio of thenumber of grid points using AMR to that required if a uniform mesh is used.

Enabling scalable parallel implementations of SAMR applications 179

analyses the computation and communication behavior for parallel SAMR. Section 4details the SAMR hierarchical partitioning framework that includes the greedy parti-tioner and level-based and bin-packing based partitioning/load-balancing techniques.Section 5 discusses the behavior of MPI non-blocking communication primitives onIBM SP2 and outlines the optimization strategy for the SAMR communication sub-strate. Section 6 describes the experimental evaluation of the SAMR runtime engineand scalability analysis using the RM3D application. Section 7 presents concludingremarks.

2 Structured adaptive mesh refinement

2.1 Overview of SAMR

The numerical solution to a PDE is obtained by discretizing the problem domain andcomputing an approximate solution to the PDE at the discrete points. One approachto discretizing the domain is to introduce a structured uniform Cartesian grid. Theunknowns of the PDE are then approximated numerically at each discrete grid point.The resolution of the grid (or grid spacing) determines the local and global error ofthis approximation and is typically dictated by the solution features that need to beresolved. The resolution also determines computational costs and storage require-ments.

For problems with “well behaved” PDE solutions, it is typically possible to finda grid with uniform resolution that provides the required solution accuracy usingacceptable computational and storage resources. However, for problems where thesolution contains shocks, discontinuities, or steep gradients, finding such a uniformgrid that meets accuracy and resource requirements is not always possible. Further-more, in these classes of problems, the solution features that require high resolutionare localized causing the high resolution and associated computational effort in otherregions of the uniform grid to be wasted. Finally, due to solution dynamics in time-dependent problems, it is difficult to estimate the minimum grid resolution requiredto obtain acceptable solution accuracy.

In the case of SAMR methods, dynamic adaptation is achieved by tracking re-gions in the domain that require higher resolution and dynamically overlaying finergrids on these regions. These techniques start with a base coarse grid with minimumacceptable resolution that covers the entire computational domain. As the solutionprogresses, regions in the domain with high solution error, requiring additional reso-lution, are identified and refined. Refinement proceeds recursively so that the refinedregions requiring more resolution are similarly tagged and even finer grids are over-laid on these regions. The resulting grid structure is a dynamic adaptive grid hierar-chy. The adaptive grid hierarchy corresponding to the SAMR formulation by Bergerand Oliger [5] is illustrated in Fig. 1.

2.2 Related work on SAMR

Currently, there exist a wide spectrum of software infrastructures and partitioning li-braries that have been developed and deployed to support parallel and distributed im-plementations of SAMR applications. Each system represents a unique combination


Fig. 1 Berger-Oliger SAMR formulation of adaptive grid hierarchies

of design decisions in terms of algorithms, data structures, decomposition schemes,mapping and distribution strategies, and communication mechanisms. A selection ofrelated SAMR research efforts are summarized in this section.

PARAMESH [16] is a FORTRAN library designed to facilitate the parallelizationof an existing serial code that uses structured grids. The package builds a hierarchyof sub-grids to cover the computational domain, with spatial resolution varying tosatisfy the demands of the application. These sub-grid blocks form the nodes of a treedata-structure (quad-tree in 2-D or oct-tree in 3-D). Each grid block has a logicallyCartesian structured mesh.

Structured Adaptive Mesh Refinement Application Infrastructure (SAMRAI)[12, 31] provides computational scientists with general and extensible software sup-port for the prototyping and development of parallel SAMR applications. The frame-work contains modules for handling tasks such as visualization, mesh management,integration algorithms, geometry, etc. SAMRAI makes extensive use of C++ object-oriented techniques and various design patterns such as abstract factory, strategy andchain of responsibility. Load-balancing in SAMRAI occurs independently at eachrefinement level resulting in convenient handling but increased parent-child commu-nication.

The Chombo [8] package provides a distributed infrastructure and set of tools forimplementing parallel calculations using finite difference methods for the solutionof partial differential equations on block-structured, adaptively refined rectangulargrids. Chombo includes both elliptic and time-dependent modules as well as supportfor parallel platforms and standardized self-describing file formats.

Grid Adaptive Computational Engine (GrACE) [20] is an adaptive computationaland data-management framework for enabling distributed adaptive mesh refinementcomputations on structured grids. It is built on a virtual, semantically specialized dis-tributed shared memory substrate with multifaceted objects specialized to distributed


adaptive grid hierarchies and grid functions. The development of GrACE is based onsystems engineering focusing on abstract types, a layered approach, and separation ofconcerns. GrACE allows the user to build parallel SAMR applications and providessupport for multigrid methods.

One approach to dynamic load-balancing (DLB) [13] for SAMR applicationscombines a grid-splitting technique with direct grid movements so as to efficiently re-distribute workload among processors and reduce parallel execution time. The DLBscheme is composed of two phases: moving-grid phase and splitting-grid phase. Themoving-grid phase is invoked after each adaptation and utilizes the global informa-tion to send grids directly from overloaded processors to underloaded processors. Ifimbalance still exists, the splitting-grid phase is invoked and splits a grid into twosmaller grids along the longest dimension. This sequence continues until the loadis balanced within a given tolerance. A modified DLB scheme on distributed sys-tems [14] considers the heterogeneity of processors and the heterogeneity and dy-namic load of the network. The scheme employs global load-balancing and localload-balancing, and uses a heuristic method to evaluate the computational gain andredistribution cost for global redistribution.

The stack of partitioners within the SAMR runtime engine, presented in this pa-per, build on a locality preserving space-filling curve [11, 23, 26] based representa-tion of the SAMR grid hierarchy [21] and enhance it based on localized requirementsto minimize synchronization costs within a level (level-based partitioning), balanceload (bin-packing based load-balancing), or reduce partitioning costs (greedy par-titioner). Moreover, the optimization of the underlying SAMR communication sub-strate alleviates synchronization costs. Coupled together, these features of the SAMRruntime engine address space-time heterogeneity and dynamic load requirements ofSAMR applications, and enable their scalable implementations on large numbers ofprocessors.

3 Computation and communication behavior for parallel SAMR

In the targeted SAMR formulation, the grid hierarchy is refined both in space and intime. Refinements in space create finer level grids which have more grid points/cellsthan their parents. Refinements in time mean that finer grids take smaller time stepsand hence have to be advanced more often. As a result, finer grids not only havegreater computational loads, but also have to be integrated and synchronized moreoften. This results in a space-time heterogeneity in the SAMR adaptive grid hierar-chy. Furthermore, regridding occurs at regular intervals at each level of the SAMRhierarchy and results in refined regions being created, moved, and deleted. Figure2 illustrates a 2-D snapshot of a sample SAMR application representing a combus-tion simulation [25] that investigates the ignition of a H2-Air mixture. The top halfof Fig. 2 depicts the temperature profile of the non-uniform temperature field with 3hot-spots, while the bottom half of the figure shows the mass-fraction plots of variousradicals produced during the simulation. The application exhibits high dynamism andspace-time heterogeneity, and presents significant runtime and scalability challenges.


Fig. 2 2-D snapshots of a combustion simulation illustrating the ignition of H2-Air mixture in anon-uniform temperature field with three hot spots (Courtesy: J. Ray, et al., Sandia National Laboratory).The top half of the figure shows the temperature profile while the bottom half shows the mass-fractionplots of radicals

3.1 Parallel SAMR implementations

Parallel implementations of SAMR applications typically partition the dynamic gridhierarchy across available processors and the processors operate on their local por-tions of the computational domain in parallel. Each processor starts at the coarsestlevel, integrates the patches at this level, and performs intra-level or “ghost” commu-nications to update the boundaries of the patches. It then recursively operates on thefiner grids using the refined time steps—i.e., for each step on a parent grid, there aremultiple steps (equal to the time refinement factor) on a child grid. When the parentand child grids are at the same physical time, inter-level communications are used toinject information from the child to its parent. The solution error at different levelsof the SAMR grid hierarchy is evaluated at regular intervals and this error is used to


determine the regions where the hierarchy needs to be locally refined or coarsened.Dynamic re-partitioning and redistribution is typically required after this step.

The timing diagram (note that this diagram is not to scale) in Fig. 3 illustratesthe operation of the SAMR algorithm described above using a 3-level grid hierarchy.For simplicity, only the computation and communication behaviors of processors P 1and P 2 are shown. The three components of communication overheads (described inSect. 3.2) are illustrated in the enlarged portion of the time line. Note that the timingdiagram shows that there is one time step on the coarsest level (level 0) of the gridhierarchy followed by two time steps on the first refinement level and four time stepson the second level, before the second time step on level 0 can start. Also, note thatthe computation and communication for each refinement level are interleaved.

The overall efficiency of parallel SAMR applications is limited by the ability topartition the underlying grid hierarchies at runtime to expose all inherent parallelism,minimize communication and synchronization overheads, and balance load. A criticalrequirement while partitioning these adaptive grid hierarchies is the maintenance oflogical locality, both across different levels of the hierarchy under expansion andcontraction of the adaptive grid hierarchy structure, and within partitions of gridsat all levels when they are decomposed and mapped across processors. The formerenables efficient computational access to the grids while the latter minimizes thetotal communication and synchronization overheads. Furthermore, the grid hierarchyis dynamic and application adaptation results in application grids being dynamicallycreated, moved, and deleted. This behavior makes it necessary and quite challengingto efficiently re-partition the hierarchy at runtime to both balance load and minimizesynchronization overheads.

Fig. 3 Timing diagram for a parallel implementation of Berger-Oliger SAMR algorithm, showing com-putation and communication behaviors for two processors


3.2 Communication overheads for parallel SAMR

As shown in Fig. 3, the communication overheads of parallel SAMR applicationsprimarily consist of three components: (1) Inter-level communications are definedbetween component grids at different levels of the grid hierarchy and consist of pro-longations (coarse to fine grid data transfer and interpolation) and restrictions (fineto coarse grid data transfer and interpolation); (2) Intra-level communications are re-quired to update the grid elements along the boundaries of local portions of a dis-tributed grid and consist of near-neighbor exchanges. These communications canbe scheduled so as to be overlapped with computations on the interior region; and(3) Synchronization costs occur when load is not balanced among processors at anytime step and at any refinement level. Note that there are additional communicationcosts due to the data movement required during dynamic load-balancing and redistri-bution.

Clearly, an optimal partitioning of the SAMR grid hierarchy and scalable SAMRimplementations require a careful consideration of the timing pattern and the com-munication overheads described above. A critical observation from Fig. 3 is that, inaddition to balancing the total load assigned to each processor, the load-balance ateach refinement level and the communication/synchronization costs within a levelneed to be addressed. This leads to a trade-off between (i) maintaining parent-childlocality for component grids that results in reduced communication overheads butpossibly high load imbalance, and (ii) “orphaning” component grids (i.e., isolatinggrid elements at different refinement levels of the SAMR hierarchy) resulting in bet-ter load balance with slightly higher synchronization costs.

3.3 Driving SAMR application—RM3D kernel

The driving SAMR application used in this paper is the 3-D compressible turbulenceapplication kernel solving the Richtmyer-Meshkov (RM3D) instability. The RM3Dapplication is part of the virtual test facility (VTF) developed at the ASCI/ASAPCenter at the California Institute of Technology2. The Richtmyer-Meshkov instabilityis a fingering instability which occurs at a material interface accelerated by a shockwave. This instability plays an important role in studies of supernova and inertialconfinement fusion. The RM3D application is dynamic in nature and exhibits space-time heterogeneity, and is representative of the simulations targeted by this research.A selection of snapshots for the RM3D adaptive SAMR grid hierarchy are shown inFig. 4.

In particular, RM3D is characterized by highly localized solution features result-ing in small patches and deep application grid hierarchies (i.e., a small region isvery highly refined in space and time). As a result, the application has increasingcomputational workloads and greater communication/synchronization requirementsat higher refinement levels with unfavorable computation to communication ratios.The physics modeled by the RM3D application for detonation in a deforming tube

2Center for Simulation of Dynamic Response of Materials—http://www.cacr.caltech.edu/ASAP/


Fig. 4 Snapshots of the grid hierarchy for a 3-D Richtmyer-Meshkov simulation. Note the dynamics ofthe SAMR grid hierarchy as the application evolves

Table 1 Sample SAMR statistics for RM3D application on 32, 64, and 128 processors of IBM SP2 “BlueHorizon” using a 128*32*32 coarse grid, executing for 50 iterations with 3 levels of factor 2 space-timerefinements and regridding performed every 4 steps

Application 32 64 128

Parameter processors processors processors

AMR efficiency 87.50% 85.95% 85.29%

Avg. blocks per processor per regrid 15 12 9

Avg. memory per processor (MB) 360.64 202.72 106.08

Minimum block size 4 4 4

Maximum block size 128 128 128

is illustrated in Fig. 5. Samples of the SAMR statistics for RM3D execution on 32,64, and 128 processors of IBM SP2 “Blue Horizon” are listed in Table 1. The re-sulting SAMR grid hierarchy characteristics significantly limit RM3D scalability onlarge numbers of processors. The SAMR runtime engine described in Sects. 4 and5 addresses these runtime challenges in synchronization, load-balance, locality, andcommunication to realize scalable RM3D implementations.

4 SAMR hierarchical partitioning framework

The hierarchical partitioning framework within the SAMR runtime engine consists ofa stack of partitioners that can manage the space-time heterogeneity and dynamismof the SAMR grid hierarchy. These dynamic partitioning algorithms are based on a


Fig. 5 Richtmyer-Meshkov detonation in a deforming tube modeled using SAMR with 3 levels of refine-ment. The Z = 0 plane is visualized on the right (Courtesy: R. Samtaney, VTF+GrACE, Caltech)

core Composite Grid Distribution Strategy (CGDS) belonging to the GrACE3 SAMRinfrastructure [20]. This domain-based partitioning strategy performs a composite de-composition of the adaptive grid hierarchy using Space-filling Curves (SFCs) [11, 17,23, 26]. SFCs are locality preserving recursive mappings from n-dimensional space to1-dimensional space. At each regridding stage, the new refinement regions are addedto the SAMR domain and the application grid hierarchy is reconstructed. CGDS usesSFCs and partitions the entire SAMR domain into sub-domains such that each sub-domain keeps all refinement levels in the sub-domain as a single composite grid unit.Thus, all inter-level communication are local to a sub-domain and the inter-level com-munication time is greatly reduced. The resulting composite grid unit list (GUL) forthe overall domain must now be partitioned and balanced across processors. However,certain SAMR applications with localized refinements and deep hierarchies (such asRM3D) have substantially higher computational requirements at finer levels of thegrid hierarchy. As described in Sect. 3.2, maintaining a GUL with composite gridunits may result in high load imbalance across processors in such cases. To addressthis concern, CGDS allows a composite grid unit with high workload (greater than theload-balancing threshold) to be orphaned/separated into multiple sub-domains, eachcontaining a single level of refinement. An efficient load-balancing scheme withinCGDS can use this “orphaning” approach to alleviate processor load imbalances andprovide improved application performance despite the increase in inter-level commu-nication costs.

The layered structure of the SAMR hierarchical partitioning framework is shownin Fig. 6. Once the grid hierarchy index space is mapped using the SFC+CGDSscheme, higher-level partitioning techniques can be applied in a hierarchical mannerusing the hierarchical partitioning algorithm (HPA). In HPA [15], processor groupsare defined based on the dynamic grid hierarchy structure and correspond to regions

3GrACE SPMD data-management framework: http://www.caip.rutgers.edu/TASSL/Projects/GrACE.


Fig. 6 Layered design of theSAMR hierarchical partitioningframework within the SAMRruntime engine. SFC =space-filling curves, CGDS =Composite Grid DistributionStrategy, HPA = hierarchicalpartitioning algorithm, LPA =level-based partitioningalgorithm, GPA = greedypartitioning algorithm, BPA =bin-packing based partitioningalgorithm

of the overall computational domain. The top processor group partitions the globalGUL obtained initially and assign portions to each processor sub-group in a hierar-chical manner. In this way, HPA further localizes the communication to sub-groups,reduces global communication and synchronization costs, and enables concurrentcommunication. Within each processor sub-group, higher-level partitioning strate-gies are then applied based on the local requirements of the SAMR grid hierarchysub-domain. The objective could be to minimize synchronization costs within a levelusing the level-based partitioning algorithm (LPA), or efficiently balance load usingthe bin-packing based partitioning algorithm (BPA), or reduce partitioning costs us-ing the greedy partitioning algorithm (GPA), or combinations of the above. GPA andBPA form the underlying distribution schemes that can work independently or can beaugmented using LPA and/or HPA.

4.1 Greedy partitioning algorithm

GrACE uses a default greedy partitioning algorithm to partition the global GUL andproduce a local GUL for each processor. The GPA scheme performs a rapid par-titioning of the SAMR grid hierarchy as it scans the global GUL only once whileattempting to distribute the load equally among all processors, based on a linear as-signment of grid units to processors. If the workload of a grid unit exceeds processorthreshold, it is assigned to the next successive processor and the threshold is adjusted.GPA helps in reducing partitioning costs and works quite well for a relatively homo-geneous computational domain with few levels of relatively uniform refinement, andsmall-to-medium scale application runs. However, due to the greedy nature of thealgorithm, GPA tends to result in overloading of processors encountered near the endof the global GUL, since the load imbalances from previous processors have to beabsorbed by these latter processors. Scalable SAMR applications require a good loadbalance during the computational phase between two regrids of the dynamic grid hi-erarchy. In applications with localized features and deep grid hierarchies, the loadimbalance in GPA at higher levels of refinement can lead to large synchronizationdelays, thus limiting SAMR scalability.


Fig. 7 Partitions of a 1-D grid hierarchy for (a) GPA, (b) LPA

4.2 Level-based partitioning algorithm

The computational workload for a certain patch of the SAMR application is tightlycoupled to the refinement level at which the patch exists. The computational workloadat a finer level is considerably greater than that at coarser levels. The level-based par-titioning algorithm (LPA) [15] attempts to simultaneously balance load and minimizesynchronization cost. LPA essentially preprocesses the global application computa-tional units represented by a global GUL, disassembles them based on their refine-ment levels, and feeds the resulting homogeneous units at each refinement level toGPA (or any other partitioning/load-balancing scheme). The GPA scheme then par-titions this list to balance the workload. Due to the preprocessing, the load on eachrefinement level is also balanced.

LPA benefits from the SFC-based technique by maintaining parent-children re-lationship throughout the composite grid and localizing inter-level communications,while simultaneously balancing the load at each refinement level, which reduces thesynchronization cost as demonstrated by the following example. Consider the parti-tioning of a one-dimensional grid hierarchy with two refinement levels, as illustratedin Fig. 7. For this 1-D example, GPA partitions the composite grid unit list into twosub-domains. These two parts contain exactly the same load: the load assigned toP 0 is 2 + 2 × 4 while the load assigned to P 1 is 10 units. From the viewpoint ofGPA scheme, the partition result is perfectly balanced as shown in Fig. 7a. How-ever, due to the heterogeneity of the SAMR algorithm, this distribution leads to largesynchronization costs as shown in the timing diagram (Fig. 8a). The LPA schemetakes these synchronization costs at each refinement level into consideration. For thissimple example, LPA will produce a partition as shown in Fig. 7b which results inthe computation and communication behavior depicted in Fig. 8b. As a result, thereis an improvement in overall execution time and a reduction in communication andsynchronization time.


Fig. 8 Timing diagrams showing computation and communication behaviors for a 1-D grid hierarchypartitioned using (a) GPA, (b) LPA

4.3 Bin-packing based partitioning algorithm

The bin-packing based partitioning algorithm (BPA) improves the load-balance dur-ing the SAMR partitioning phase. The computational workload associated witha GUL at different refinement levels of the SAMR grid hierarchy is distributed amongavailable processors. The distribution is performed under constraints such as mini-mum block size (granularity) and aspect ratio. Grid units with loads larger than thethreshold limit are decomposed geometrically along each dimension into smaller gridunits, as long as the granularity constraint is satisfied. This decomposition can occurrecursively for a grid unit if its workload exceeds processor threshold. If the work-load is still high and the orphaning strategy is enabled, the grid units with minimumgranularity are separated into multiple uni-level grid elements for better load bal-ance.

Initially, BPA distributes the global GUL workload among processors based onprocessor load threshold, in a manner similar to GPA. A grid unit that cannot be allo-cated to the current processor, even after decomposition and orphaning, is assigned tothe next consecutive processor. However, no processor accepts work greater than thethreshold in the first phase. Grid units representing unallocated loads after the firstphase are distributed among processors using a “best-fit” approach. If no processormatches the load requirements of an unallocated grid unit, the “most-free” approach(i.e., the processor with least load accepts the unallocated work) is adopted until allthe work in the global GUL is assigned.

BPA allows the user to set a tolerance value that determines the acceptable work-load imbalance for the SAMR application. In case of BPA, the load imbalance, ifany, is low since it is bounded by the tolerance threshold. Due to the underlying bin-packing algorithm, the BPA technique provides overall better load balance for thegrid hierarchy partitions among processors as compared to the GPA scheme. How-ever, a large number of patches may be created as a result of multiple patch divisions.


Also, the load distribution strategy in BPA can result in multiple scans of the grid unitlist that marginally increases the partitioning overheads.

A combined approach using LPA and BPA can provide good scalability bene-fits for SAMR applications since LPA reduces the synchronization costs and BPAyields good load balance at each refinement level. The experimental evaluation pre-sented in Sect. 6 employs this combined approach. It disassembles the applicationglobal GULs into uniform patches at various refinement levels using level-based par-titioning, which is subsequently followed by bin-packing based load-balancing forthe patches at each refinement level of the SAMR grid hierarchy.

5 Optimizing the SAMR communication substrate

Due to irregular load distributions and communication requirements across differentlevels of the grid hierarchy, parallel SAMR implementations make extensive use ofMPI non-blocking primitives to reduce synchronization overheads. A typical imple-mentation is illustrated in Fig. 9. In this implementation, each processor maintainslocal lists of all messages to be sent to and received from its neighboring processors.As seen in Fig. 9, each process first posts non-blocking receives (MPI_Irecv) and thenpacks each message from its send list into a buffer and sends it using MPI_Isend. Fol-lowing each MPI_Isend, a corresponding MPI_Wait is posted to ensure completionof the send operation so that the corresponding send buffer can be freed. Once all thesends are completed, a MPI_Waitall is then posted to check completions of the re-ceives. This exchange is a typical ghost communication associated with parallel finitedifference PDE solvers.

{/*Initiate all non-blocking receives before hand*/LOOPMPI_Irecv();END LOOPfor each data operation {

for all neighboring processors {Pack local boundary values/*Send boundary values*/MPI_Isend();/*Wait for completion of my sends*/MPI_wait();Free the corresponding send buffer

}/*Wait for completion of all my receives*/MPI_Waitall();Unpack received data

}*****COMPUTATION CONTINUES*****}

Fig. 9 Intra-level communication model for parallel SAMR implementations


Typical message sizes for SAMR applications are in the order of hundreds ofKilobytes. However, increasing message sizes can affect the behavior of MPI non-blocking communication, thereby influencing the synchronization latencies and scal-ability of SAMR applications. Asynchronous MPI behavior is limited by the sizeof the communication buffers and the message-passing implementation. As a result,naive use of these operations without an understanding of the underlying implemen-tation can result in serious performance degradations, often producing synchronousbehaviors.

As a part of this research, we have investigated the behavior of non-blockingcommunication primitives provided by popular MPI implementations (such as theMPICH implementation on a Linux Beowulf cluster and the native IBM MPI im-plementation on an IBM SP2). Since the performance of MPI primitives is sensi-tive to implementation details, we propose architecture-specific strategies that canbe effectively applied within the SAMR communication model to reduce proces-sor synchronization costs [27]. A “staggered sends” approach improves perfor-mance for the MPICH implementation on a Beowulf architecture while the IBMSP2 benefits from a “delayed waits” strategy. The primary objective of this pa-per is to address and evaluate SAMR scalability on large numbers of processorsthat are available on the IBM SP2. Therefore, the rest of this section focuses onthe analysis and optimization of MPI non-blocking communication primitives onthe IBM SP2. However, we emphasize that non-blocking communication optimiza-tions are applicable to other architectures as well, depending on the MPI seman-tics.

5.1 Analysis of MPI non-blocking communication semantics on IBM SP2

This section experimentally investigates the effect of message size on MPI non-blocking communication behavior for IBM MPI version 3 release 2 implementa-tion on the IBM SP2, and determines the thresholds at which the non-blockingcalls require synchronization. In this analysis, the sending process (process 0) is-sues MPI_Isend (IS) at time-step T 0 to initiate a non-blocking send operation. At thesame time, the receiving process (process 1) posts a matching MPI_Irecv (IR) call.Both processes then execute unrelated computation before executing an MPI_Waitcall (denoted by Ws on the send side and Wr on the receive side) at T 3 to wait forcompletion of the communication. The processes synchronize at the start of the ex-periment and use deterministic offsets to vary values of T 0, T 1, T 2, and T 3 at eachprocess. The value of T 0–T 3 are approximately the same on the two processes andthe message size is varied for different evaluations on the IBM SP2. The systembuffer size is maintained at the default value.

For smaller message sizes (1 KB), the expected non-blocking MPI semanticsare observed. As observed in Fig. 10, IS and IR return immediately. Furthermore,Ws and Wr posted after local computations also return almost immediately, indi-cating that the message was delivered during the computation. This behavior isalso true for larger message sizes (greater than or equal to 100 KB), i.e., the IBMMPI_Isend implementation continues to return immediately as per the MPI speci-fication, and Ws and Wr take minimal time (of the order of microseconds) to re-turn.


To analyze the effect of increasing message size on the IBM SP2, the MPI_Waiton the send side (Ws) is moved to T 1, i.e., directly after the send to simulate thesituation where one might want to reuse the send buffer. However, the MPI_Waitcall on the receive side (Wr) remains at T 3. In this case, the non-blocking behaviorremains unchanged for small messages. However, for message sizes greater than orequal to 100 KB, it is observed that Ws blocks until Wr is posted by the receiverat T 3. This behavior is illustrated in Fig. 11. In an experiment where both processesexchange messages, IS and IR are posted at T 0, process 0 posts Ws at T 1 whileprocess 1 posts Ws at T 2, and both processes post Wr at T 3. The message size ismaintained at 100 KB. Deadlock is avoided in this case when Ws, posted at T 1 andblocks on process 0, returns as soon as process 1 posts Ws at T 2, rather than waitingfor the corresponding Wr on T 3.

The IBM SP2 parallel operating environment (POE) imposes a limit called the“eager limit” on the MPI message size that can be sent out asynchronously. Thislimit is set by the environment variable MP_EAGER_LIMIT and is directly depen-dent on the size of the memory that MPI uses and the number of processes in theexecution environment [19]. For message sizes less than this eager limit, the mes-sages are sent asynchronously. When message sizes exceed the eager limit, the IBMMPI implementation switches to the synchronous mode such that Ws blocks untilan acknowledgment arrives at the receiver (Wr is posted), as demonstrated in theabove experiment. Note that the synchronization call on the receive side need not bea matching wait. In fact, the receiver may post any call to MPI_Wait (or any of itsvariants) to complete the required synchronization.

5.2 “Delayed Waits” on IBM SP2

The analysis on the IBM MPI implementation demonstrates that the time spent wait-ing for processes to synchronize is the major source of communication overheads,rather than the network latency. This problem is particularly significant in applica-tions where communications are not completely synchronized and there is some loadimbalance, which is typical in SAMR applications. An attempt to increase the MPIeager limit on the IBM SP2 is not a scalable solution, since increasing this value of theenvironment variable increases the MPI memory usage per processor, thus reducingthe amount of overall memory available to the SAMR application. A more scalablestrategy is to address this at the application level by appropriating positioning IS, IR,

Fig. 10 MPI test profile on IBM SP2: Ws and Wr are posted at the same time-step


Fig. 11 MPI test profile on IBM SP2: Ws and Wr are posted at different time-steps

For n=1 to number_of_messages_to_receive{MPI_IRECV (n, msgid_n)

}***COMPUTE***For n=1 to number_of_messages_to_send{

MPI_ISEND (n, send_msgid_n)MPI_WAIT (send_msgid_n)

}MPI_WAITALL (recv_msgid_*)

Fig. 12 Example of a MPI non-blocking implementation on IBM SP2

For n=1 to number_of_messages_to_receive{MPI_IRECV (n, msgid_n)

}

****COMPUTE****

For n=1 to number_of_messages_to_send{MPI_ISEND (n,send_msgid_n)

}

MPI_WAITALL (recv_msgid_*+send_msgid_*)

Fig. 13 Optimization of MPI non-blocking implementation on IBM SP2 using delayed waits

Ws and Wr calls. The basic optimization strategy consists of delaying Ws until afterWr and is shown in Figs. 12 and 13.

To illustrate the strategy, consider a scenario in which two processes exchange asequence of messages and the execution sequence is split into steps T 0–T 3. Bothprocesses post MPI_Irecv (IR) calls at T 0 and Wall denotes a MPI_Waitall call.Assume that, due to load imbalance, process 0 performs computation until T 2 whileprocess 1 computes only until T 1. Ws posted on process 1 at T 1 will block untilprocess 0 posts Ws at T 2. For a large number of messages, this delay can becomequite significant. Consequently, to minimize the blocking overhead due to Ws onprocess 1, it must be moved as close to T 2 as possible, corresponding to Ws onprocess 0. Now, if Ws is removed from the send loop and a collective MPI_Waitall isposted as shown in Fig. 13, it is observed that process 1 posts all of its non-blocking


sends. Also, by the time process 1 reaches T 2, it has already posted IS for all of itsmessages and is waiting on Wall, thus reducing synchronization delays.

6 Scalability evaluation of the SAMR runtime engine

The experimental evaluations of individual components of the SAMR runtime en-gine such as LPA, BPA, and the delayed waits optimization, as well as the scalabilityfor SAMR applications are performed using the RM3D compressible turbulence ap-plication kernel (described in Sect. 3.3) on the NPACI IBM SP2 “Blue Horizon”supercomputer. Blue Horizon is a teraflop-scale Power3 based clustered SymmetricMultiprocessing (SMP) system from IBM, installed at the San Diego Supercomput-ing Center (SDSC). The machine contains 1,152 processors running AIX that arearranged as 144 SMP compute nodes. The nodes have a total of 576 GB of mainmemory such that each node is equipped with 4 GB of memory shared among itseight 375 MHz Power3 processors. Each node also has several gigabytes of local diskspace. Nodes are connected by the Colony switch, a proprietary IBM interconnect.

6.1 Evaluation: LPA and BPA techniques

The LPA and BPA partitioning/load-balancing techniques are evaluated for theRM3D application on 64 processors of Blue Horizon using the GrACE infrastructure.RM3D uses a base grid of 128*32*32 and 3 levels of factor 2 space-time refinementswith regridding performed every 8 time-steps at each level. The RM3D applicationexecutes for 100 iterations and the total execution time, synchronization (Sync) time,recompose time, average maximum load imbalance, and the number of boxes aremeasured for each of the following three configurations, namely: (i) default GrACEpartitioner (GPA), (ii) LPA scheme using GrACE, and (iii) the LPA + BPA techniqueusing GrACE, i.e., the combined approach.

Figure 14 illustrates the effect of LPA and BPA partitioning schemes on RM3Dapplication performance. Note that the values plotted in the figure are normal-ized against the corresponding maximum value. The results demonstrate that theLPA scheme helps to reduce application synchronization time while the BPA tech-nique provides better load balance. A combined approach reduces the overall exe-cution time by around 10% and results in improved application performance. The

Fig. 14 Evaluation of RM3Dperformance (normalizedvalues) for LPA and BPApartitioning schemes on 64processors of Blue Horizon


LPA + BPA strategy improves load balance by approximately 70% as compared tothe default GPA scheme. The improvements in load balance and synchronization timeoutweigh the overheads (increase in number of boxes) incurred while performing theoptimizations within LPA and BPA. This evaluation is performed on a reduced scaleon fewer processors, and hence the performance benefits of the SAMR partitioningframework are lower. However, these hierarchical partitioning strategies become crit-ical while addressing scalable SAMR implementations, as shown later in this paper.

6.2 Evaluation: delayed waits optimization

The evaluation for delayed waits optimization consists of measuring the messagepassing and application execution times for the RM3D application with and withoutthe optimization in the SAMR communication substrate. Except for the MPI non-blocking optimization, all other application-specific and refinement-specific parame-ters are kept constant. RM3D uses a base grid size of 256*64*64 and 3 levels offactor 2 space-time refinements with regridding performed every 8 time-steps at eachlevel, and the application executes for 100 iterations. Figures 15 and 16 show thecomparisons of the execution times and communication times for the two configu-rations on 64, 128, and 256 processors of Blue Horizon, respectively. The delayedwaits optimization helps to reduce overall execution time, primarily due to the de-crease in message passing time. In this evaluation, the reduction in communicationtime is 44.37% on the average and results in improved application performance.

Fig. 15 RM3D performanceimprovement in total executiontime using delayed waitsoptimization for MPInon-blocking communicationprimitives on IBM SP2

Fig. 16 Comparison of RM3Dmessage passing performanceusing delayed waitsoptimization for MPInon-blocking communicationprimitives on IBM SP2


6.3 Evaluation: overall RM3D scalability

The evaluation of overall RM3D SAMR scalability uses a base coarse grid of size128*32*32 and the application executes for 1000 coarse level time-steps. The experi-ments are performed on 256, 512, and 1024 processors of Blue Horizon using 4 levelsof factor 2 space-time refinements and regridding is performed every 64 time-stepsat each level. The partitioning algorithm chosen from the hierarchical SAMR parti-tioning framework for these large-scale evaluations is LPA + BPA since a combina-tion of these two partitioners results in reduced synchronization costs and better loadbalance, as described in Sects. 4 and 6.1. The orphaning strategy and a local refine-ment approach are used in conjunction with LPA + BPA in this evaluation since theRM3D application exhibits localized patterns and deep hierarchies with large com-putational and communication requirements. Furthermore, the delayed waits strategythat optimizes MPI non-blocking communication is employed for these RM3D scal-ability tests to further reduce messaging-level synchronization delays, as discussed inSect. 6.2. The minimum block size for a patch on the grid is maintained at 16.

The scalability tests on 256, 512, and 1024 processors measure the runtimemetrics for RM3D application. For these three experiments, the overall executiontime, average computation time, average synchronization time, and average regrid-ding/redistribution time are illustrated in Figs. 17, 18, 19, and 20 respectively. Thevertical error bars in Figs. 18, 19, and 20 represent the standard deviations of the cor-

Fig. 17 Scalability evaluationof the overall execution time for1000 coarse-level iterations ofRM3D SAMR application on a128*32*32 base grid with 4levels of refinement—log graphsfor 256, 512, and 1024processors of Blue Horizon

Fig. 18 Scalability evaluationof the average computation timefor 1000 coarse-level iterationsof RM3D SAMR application ona 128*32*32 base grid with 4levels of refinement—log graphsfor 256, 512, and 1024processors of Blue Horizon. Thevertical error bars are standarddeviations of the computationtimes


Table 2 Scalability evaluationfor RM3D application on 256,512, and 1024 processors ofBlue Horizon in terms ofcoefficients of variation (CV)and additional metrics. TheRM3D evaluation is performedfor 1000 coarse-level iterationson a 128*32*32 base grid with 4levels of refinement

Application 256 512 1024

parameter processors processors processors

Computation CV 11.31% 15.5% 12.91%

Synchronization CV 8.48% 6.62% 7.41%

Regridding CV 2.89% 7.52% 9.1%

Avg. time per sync 0.43 sec 0.38 sec 0.22 sec

Avg. time per regrid 5.35 sec 7.74 sec 9.46 sec

responding metrics. Table 2 presents the coefficients of variation4 (CV) for computa-tion, synchronization and regridding times, and the average time per synchronizationand regrid operation.

As seen in Fig. 17, the scalability ratio of overall application execution time from256 to 512 processors is 1.394 (ideal scalability ratio is 2) yielding a parallel effi-ciency of 69.72%. The corresponding values of scalability ratio and parallel efficiencyas processors increase from 512 to 1024 are 1.661 and 83.05% respectively. Theseexecution times and parallel efficiencies indicate reasonably good scalability, consid-ering the high computation and communication runtime requirements of the RM3Dapplication. The LPA partitioning, bin-packing based load-balancing, and MPI non-blocking communication optimization techniques collectively enable scalable RM3Druns with multiple refinement levels on large numbers of processors.

The RM3D average computation time, shown in Fig. 18, scales quite well. Thecomputation CV in Table 2 is reasonably low for the entire evaluation, implying goodoverall application load-balance provided by the SAMR runtime engine (LPA + BPAstrategy). Note that in the case of large numbers of processors, the RM3D applica-tion has few phases in which there is not enough workload in the domain that canbe distributed among all processors. In such cases, some processors remain idle dur-ing the computation phase which affects the standard deviations of the computationtimes. Hence, the computation CV for 512 and 1024 processors are slightly higherthan for 256 processors, as shown in Table 2. However, this lack of computation onlarge numbers of processors is an intrinsic characteristic of the application and not alimitation of the SAMR runtime engine.

The scalability of synchronization time is limited by the highly communication-dominated application behavior and unfavorable computation to communication ra-tios, as described in Sect. 3.3 and observed in Fig. 19. The evaluation exhibits rea-sonably low synchronization CV in Table 2, which can be attributed to the communi-cation improvements provided by the SAMR runtime engine (LPA and delayed waitstechniques). As observed in Table 2, the average time taken per synchronization op-eration reduces with an increase in the number of processors. This is primarily dueto the reduction in size of the grid units owned by processors, resulting in smallermessage transfers for boundary updates. The decrease in synchronization time isproportionately greater when processors increase from 512 to 1024 due to smaller

4The coefficient of variation is a dimensionless number that is defined as the ratio of the standard deviationto the mean for a particular metric, and is usually expressed as a percentage.


Fig. 19 Scalability evaluationof the average synchronizationtime for 1000 coarse-leveliterations of RM3D SAMRapplication on a 128*32*32 basegrid with 4 levels ofrefinement—log graphs for 256,512, and 1024 processors ofBlue Horizon. The vertical errorbars are standard deviations ofthe synchronization times

Fig. 20 Scalability evaluationof the average regridding timefor 1000 coarse-level iterationsof RM3D SAMR application ona 128*32*32 base grid with 4levels of refinement—log graphsfor 256, 512, and 1024processors of Blue Horizon. Thevertical error bars are standarddeviations of the regriddingtimes

processor workloads, further reduced message sizes, and greater parallelism amongmessage transfers. Moreover, as described earlier, some application phases may nothave sufficient workload that can be allocated to all processors during redistributionin case of large numbers of processors, forcing some processors to remain idle. Insuch cases, this can result in reduced synchronization times since the idle processorsdo not participate in the synchronization process. However, this is not a limitationof the SAMR runtime engine, but is a direct consequence of the lack of applicationdynamics.

The scalability evaluation of the regridding costs for the RM3D application is plot-ted in Fig. 20. SAMR regridding/redistribution entails error estimation, clustering andrefinement of application sub-domains, global domain decomposition/partitioning,and reconstruction of the application grid hierarchy followed by data migrationamong processors to reflect the new distribution. The average regridding time in-creases from 256 to 1024 processors since it is more expensive to perform globaldecomposition operations for larger number of processors. However, the partitioningoverheads induced by the SAMR hierarchical partitioning framework are minusculecompared to the average regridding costs. As an example, the overall partitioningoverhead for the SAMR runtime engine on 512 processors is 4.69 seconds which isnegligible compared to the average regridding time of 364.01 seconds. The regriddingCV and the average time per regrid, noted in Table 2, show a similar trend as the aver-age regridding time—these metrics increase in value for larger number of processors.


To avoid prohibitive regridding costs, the redistribution of the SAMR grid hierarchyis performed less frequently, in periods of 64 time-steps on each level.

This scalability evaluation on 256-1024 processors analyzed the runtime behaviorand overall performance for the RM3D application with a 128*32*32 base grid and 4levels of refinement. Note that a unigrid RM3D implementation for the same experi-mental settings will use a domain of size 1024*256*256 with approximately 67 mil-lion computational grid points. Such a unigrid implementation will require extremelylarge overall memory availability (a total of 600 GB) that accounts for the spatial res-olution (1024*256*256), grid data representation (8 bytes for “double” data), localcopies in memory for grid functions and data structures (approximately 50), storagefor previous, current, and next time states (total 3 sets), and temporal refinement cor-responding to the number of time-steps at the finest level (factor of 8). With 0.5 GBmemory availability per processor on Blue Horizon, this unigrid implementation willrequire 1200 processors to complete execution, which exceeds the system configu-ration of 1152 processors. Thus, unigrid implementations are not even feasible forlarge-scale execution of the RM3D application on the IBM SP2. Moreover, even ifsuch unigrid RM3D implementations were possible on large systems, they wouldentail a tremendous waste of computational resources since the RM3D applicationexhibits localized refinement patterns with high SAMR efficiencies. Consequently,a scalable SAMR solution is the only viable alternative to realize such large-scaleimplementations, which underlines the primary motivation for this research.

6.4 Evaluation: SAMR benefits for RM3D performance

To additionally evaluate the benefits of using SAMR, another experiment comparesthe RM3D application performance for configurations with identical finest-level res-olution but different coarse/base grid size and levels of refinement. This evaluation isperformed on 512 processors of Blue Horizon and all other application-specific andrefinement-specific parameters kept constant. Note that for every step on the coarselevel (level 0), two steps are taken at the first refinement level (level 1), four stepson level 2, and so on. Each configuration has the same resolution and executes forthe same number of time-steps (in this case, 8000) at the finest level of the SAMRgrid hierarchy. The RM3D domain size at the finest level is 1024*256*256 and the

Table 3 RM3D performance evaluation of SAMR on 512 processors of Blue Horizon for different ap-plication base grids and varying refinement levels. Both experiments have the same finest resolution andexecute for 8000 steps at the finest level

Application 4-Level Run 5-Level Run Performance

parameter (128*32*32 (64*16*16 improvement

base grid) base grid)

Overall Execution time 10805 sec 6434.62 sec 40.45%

Avg. Computation time 2912.51 sec 1710.15 sec 41.28%

Avg. Synchronization time 2823.05 sec 1677.18 sec 40.59%

Avg. Regridding time 364.01 sec 255.34 sec 29.85%


Table 4 Coefficients ofvariation (CV) and additionalmetrics for evaluating RM3DSAMR performance on 512processors of Blue Horizonusing 4 and 5 refinement levels

Application 4-Level Run 5-Level Run

parameter (128*32*32 base) (64*16*16 base)

Computation CV 15.5% 14.27%

Synchronization CV 6.62% 7.67%

Regridding CV 7.52% 4.39%

Avg. time per sync 0.38 sec 0.32 sec

Avg. time per regrid 7.74 sec 7.74 sec

evaluation uses the LPA + BPA technique with a minimum block size of 16. The firstconfiguration (4-Level Run) uses a coarse grid of size 128*32*32 with 4 levels offactor 2 space-time refinements, and executes for 1000 coarse-level iterations whichcorrespond to 8000 steps at the finest level. The second configuration (5-Level Run)uses a 64*16*16 base grid with 5 refinement levels and runs for 500 coarse-level iter-ations, corresponding to 8000 steps at the finest level, to achieve the same resolution.

Table 3 presents the runtime metrics for 4-Level and 5-Level configurations andthe performance improvement obtained with SAMR. Table 4 lists the coefficients ofvariation for computation, synchronization and regridding times for the two configu-rations, and the average time per synchronization and regrid operation. In this evalua-tion, the error estimator used for determining refinement at each level of the hierarchyis an absolute threshold of 0.005. Note that the quality of solutions obtained in thetwo configurations are comparable; however, the grid hierarchies are not identical.It is typically not possible to guarantee identical grid hierarchies for different ap-plication base grids with varying refinement levels, if the application uses localizedrefinements and performs runtime adaptations on the SAMR grid hierarchy. However,we reiterate that the two grid hierarchies in this evaluation are comparable, which isvalidated by the similar average time per regrid (shown in Table 4) for 4-Level and5-Level runs.

SAMR techniques seek to improve the accuracy of the solution by dynamically re-fining the computational grid only in the regions with large local solution error. The5-Level Run has fewer grid points in the application grid hierarchy at the finest res-olution since the refinements are localized, and hence uses fewer resources than thecorresponding 4-Level Run. Consequently, the average times for all runtime metricsshown in Table 3 and the CV values for computation and regridding (in Table 4) arelower for the 5-Level Run. Though the average time for each synchronization opera-tion is lesser for the 5-Level Run (as seen in Table 4), the standard deviation valuesper synchronization operation are similar for both configurations. Consequently, the5-Level Run has a relatively higher synchronization CV than the 4-Level Run due todeeper SAMR hierarchies requiring more inter-level communications. As observedfrom Table 3, the overall RM3D execution time shows around 40% improvement forthe 5-Level hierarchy due to the efficiency of the basic Berger-Oliger SAMR algo-rithm. These results corroborate our claim that the RM3D application exhibits local-ized patterns with high SAMR efficiencies, and can derive substantial performancebenefits from scalable SAMR implementations.


7 Conclusion

This paper presented a SAMR runtime engine to address the scalability of SAMR ap-plications with localized refinements and high SAMR efficiencies on large numbersof processors (upto 1024 processors). The SAMR runtime engine consists of twocomponents: (1) a hierarchical partitioning/load-balancing framework that can ad-dress the space-time heterogeneity and dynamism of the SAMR grid hierarchy, and(2) a SAMR communication substrate that optimizes the use of MPI non-blockingcommunication primitives. The hierarchical partitioning approach reduces globalsynchronization costs by localizing communication to processor sub-groups, andenables concurrent communication. The level-based partitioning scheme maintainslocality and reduces synchronization overheads while the bin-packing based load-balancing technique balances the computational workload at each refinement level.The underlying MPI non-blocking communication optimization uses a delayed waitsstrategy and helps in reducing overall communication costs. The SAMR runtime en-gine as well as its individual components were experimentally evaluated on the IBMSP2 supercomputer using a 3-D Richtmyer-Meshkov (RM3D) compressible turbu-lence kernel. Results demonstrated that the proposed techniques improve the perfor-mance of applications with localized refinements and high SAMR efficiencies, andenable scalable SAMR implementations that have hitherto not been feasible.

Acknowledgements This work was supported in part by the National Science Foundation via grantnumbers ACI 9984357 (CAREERS), EIA 0103674 (NGS) and EIA 0120934 (ITR), and by DOEASCI/ASAP (Caltech) via grant number PC295251 and 1052856 awarded to Manish Parashar. The au-thors would like to thank Johan Steensland for collaboration and valuable research discussions, and RaviSamtaney and Jaideep Ray for making their applications available for use.

References

1. ASCI Alliance, http://www.llnl.gov/asci-alliances/asci-chicago.html, University of Chicago2. ASCI/ASAP Center, http://www.cacr.caltech.edu/ASAP, California Institute of Technology3. Bell J, Berger M, Saltzman J, Welcome M (1994) Three-dimensional adaptive mesh refinement for

hyperbolic conservation laws. SIAM J Sci Comput 15(1):127–1384. Berger M, Hedstrom G, Oliger J, Rodrigue G (1983) Adaptive mesh refinement for 1-dimensional gas

dynamics. In: IMACS Trans Sci Comput. IMACS/North Holland, pp 43–475. Berger M, Oliger J (1984) Adaptive mesh refinement for hyperbolic partial differential equations.

J Comput Phys 53(March):484–5126. Bryan G (1999) Fluids in the universe: adaptive mesh refinement in cosmology. Comput Sci Eng

(March-April):46–537. Chandra S, Sinha S, Parashar M, Zhang Y, Yang J, Hariri S (2002) Adaptive runtime management

of SAMR applications. In: Sahni S, Prasanna V, Shukla U (eds), Proceedings of 9th internationalconference on high performance computing (HiPC’02), Lecture notes in computer science, Springer,Bangalore, India, vol 2552, December 2002, pp 564–574

8. CHOMBO, http://seesar.lbl.gov/anag/chombo/, NERSC, ANAG of Lawrence Berkeley National Lab,CA, USA, 2003

9. Choptuik M (1989) Experiences with an adaptive mesh refinement algorithm in numerical relativity.In: Evans C, Finn L, Hobill D (eds), Frontiers in numerical relativity. Cambridge University Press,London, pp 206–221

10. Hawley S, Choptuik M (2000) Boson stars driven to the brink of black hole formation. Phys Rev D62:104024

11. Hilbert D (1891) Uber die stetige Abbildung einer Linie auf Flachenstuck. Math Ann 38:459–460


12. Kohn S, SAMRAI Homepage, Structured adaptive mesh refinement applications infrastructure. http://www.llnl.gov/CASC/SAMRAI/, 1999

13. Lan Z, Taylor V, Bryan G (2001) Dynamic load balancing for structured adaptive mesh refinement ap-plications. In: Proceedings of international conference on parallel processing, Valencia, Spain, 2001,pp 571–579

14. Lan Z, Taylor V, Bryan G (2001) Dynamic load balancing of SAMR applications on distributed sys-tems. In: Proceedings of supercomputing conference (SC’01), Denver, CO, 2001

15. Li X, Parashar M (2003) Dynamic load partitioning strategies for managing data of space and timeheterogeneity in parallel SAMR applications. In: Kosch H, Boszormenyi L, Hellwagner H (eds), Pro-ceedings of 9th international Euro-par conference (Euro-Par’03), Lecture Notes in Computer Science,Springer, Klagenfurt, Austria, vol 2790, August 2003, pp 181–188

16. MacNeice P, Olson K, Mobarry C, de Fainchtein R, Packer C (2000) PARAMESH: a parallel adaptivemesh refinement community toolkit. Comput Phys Commun 126:330–354

17. Moon B, Jagadish H, Faloutsos C, Saltz J (2001) Analysis of the clustering properties of the Hilbertspace-filling curve. IEEE Trans Knowl Data Eng 13(1):124–141

18. Norman M, Bryan G (1998) Cosmological adaptive mesh refinement. In: Miyama S, Tomisaka K(eds), Numerical astrophysics. Kluwer, Tokyo

19. Parallel Environment (PE) for AIX V3R2.0, MPI programming guide, 2nd edn, December 200120. Parashar M, Browne J (1996) On partitioning dynamic adaptive grid hierarchies. In: Proceedings of

29th annual Hawaii international conference on system sciences, January 1996, pp 604–61321. Parashar M, Browne J (1995) Distributed dynamic data structures for parallel adaptive mesh refine-

ment. In: Proceedings of international conference for high performance computing, December 1995,pp 22–27

22. Parashar M, Wheeler J, Pope G, Wang K, Wang P (1997) A new generation EOS compositionalreservoir simulator: Part II—framework and multiprocessing. In: Society of petroleum engineeringreservoir simulation symposium, Dallas, TX, June 1997

23. Peano G (1890) Sur une courbe qui remplit toute une aire plane. Math Ann 36:157–16024. Pember R, Bell J, Colella P, Crutchfield W, Welcome M (1993) Adaptive Cartesian grid methods for

representing geometry in inviscid compressible flow. In: 11th AIAA computational fluid dynamicsconference, Orlando, FL, July 1993

25. Ray J, Najm H, Milne R, Devine K, Kempka S (2000) Triple flame structure and dynamics at thestabilization point of an unsteady lifted jet diffusion flame, Proc of Combust Inst 28(1):219–226

26. Sagan H (1994) Space-filling curves. Springer27. Saif T (2004) Architecture specific communication optimizations for structured adaptive mesh refine-

ment applications, M.S. thesis, Graduate School, Rutgers University28. Steensland J, Chandra S, Parashar M (2002) An application-centric characterization of domain-based

SFC partitioners for parallel SAMR. IEEE Trans Parallel Distribut Syst 13(12):1275–128929. Steinthorsson E, Modiano D (1995) Advanced methodology for simulation of complex flows using

structured grid systems. In: Surface modeling, grid generation, and related issues in CFD solutions,NASA Conference Publication 3291, May 1995

30. Wang P, Yotov I, Arbogast T, Dawson C, Parashar M, Sepehrnoori K (1997) A new generation EOScompositional reservoir simulator: Part I—formulation and discretization. In: Society of petroleumengineering reservoir simulation symposium, Dallas, TX, June 1997

31. Wissink A, Hornung R, Kohn S, Smith S, Elliott N (2001) Large scale parallel structured AMR cal-culations using the SAMRAI framework. In: Proceedings of supercomputing conference (SC’01),Denver, CO, 2001

Sumir Chandra is a Ph.D. student in the Department of Electrical and Computer Engineering and aresearcher at the Center for Advanced Information Processing at Rutgers University in Piscataway, NJ,


USA. He has also been a visiting researcher at Sandia National Laboratories in Livermore, California. Hereceived a B.E. degree in Electronics Engineering from Bombay University, India, and a M.S. degree inComputer Engineering from Rutgers University. He is a member of the IEEE, IEEE Computer Society, andSociety for Modeling and Simulation International. His research interests include computational science,software engineering, performance optimization, modeling and simulation.

Xiaolin Li is Assistant Professor of Computer Science at Oklahoma State University. He received B.E.degree from Qingdao University, China, M.E. degree from Zhejiang University, China, and Ph.D. degreesfrom National University of Singapore and Rutgers University, USA. His research interests include distrib-uted systems, sensor networks, and software engineering. He is directing the Scalable Software SystemsLaboratory (http://www.cs.okstate.edu/~xiaolin/S3Lab). He is a member of IEEE, member of the execu-tive committee of the IEEE Computer Society Technical Committee on Scalable Computing (TCSC), anda member of ACM.

Taher Saif is a Senior Java Consultant at The A Consulting Team Inc, NJ. He has been consulting withTD Ameritrade, a Brokerage Firm based in NJ for the past 3 years. Prior to his current position, he com-pleted his Masters in Computer Engineering at Rutgers University. He was a researcher at the Center forAdvanced Information Processing at Rutgers. His research interests include parallel/distributed and auto-nomic computing, software engineering, and performance optimization.

Manish Parashar is Professor of Electrical and Computer Engineering at Rutgers University, where healso is co-director of the Center for Advanced Information Processing and director of the Applied Soft-ware Systems Laboratory. He received a B.E. degree in Electronics and Telecommunications from BombayUniversity, India and M.S. and Ph.D. degrees in Computer Engineering from Syracuse University. He hasreceived the Rutgers Board of Trustees Award for Excellence in Research (2004–2005), NSF CAREERAward (1999) and the Enrico Fermi Scholarship from Argonne National Laboratory (1996). His researchinterests include autonomic computing, parallel and distributed computing, scientific computing, and soft-ware engineering.

Enabling scalable parallel implementations of structured ...parashar/Papers/... · Grid Adaptive...

Documents

Transcript of Enabling scalable parallel implementations of structured ...parashar/Papers/... · Grid Adaptive...