Semi-Static and Dynamic Load Balancing for …...Semi-Static and Dynamic Load Balancing for...

Semi-Static and Dynamic Load Balancing forAsynchronous Hurricane Storm Surge Simulations

Maximilian H BremerUniversity of Texas, Austin, TX

Lawrence Berkeley NationalLaboratory, Berkeley, CA

[email protected]

John D BachanLawrence Berkeley National

Laboratory, Berkeley, [email protected]

Cy P ChanLawrence Berkeley National

Laboratory, Berkeley, [email protected]

ABSTRACTThe performance of hurricane storm surge simulations iscritical to forecast and mitigate the deadly eects of hurri-cane landfall. Supercomputers play a key role to run thesesimulations quickly; however, disruptive changes in futurecomputer architectures will require adapting simulators tomaintain high performance, such as increasing asynchronyand improving load balance. We introduce two new multi-constraint, fully asynchronous load balancers and a newdiscrete-event simulator (DGSim) that is able to nativelymodel the execution of task-based hurricane simulationsbased on ecient one-sided, active message-based commu-nication protocols. We calibrate and validate DGSim, use itto compare the algorithms’ load balancing capabilities andtask migration costs under many parameterizations, savingof over 5,000x core-hours compared to running the applica-tion code directly. Our load balancing algorithms achievea performance improvement of up to 56 percent over theoriginal static balancer and up to 97 percent of the optimalspeed-up.

ACM Reference format:Maximilian H Bremer, John D Bachan, and Cy P Chan. 2018. Semi-Static and Dynamic Load Balancing for Asynchronous HurricaneStorm Surge Simulations. In Proceedings of Parallel ApplicationsWorkshop, Alternatives to MPI, Dallas, TX, USA, November 2018(PAW-ATM’18), 11 pages.DOI: 10.1145/nnnnnnn.nnnnnnn

1 INTRODUCTIONSince 1980, seven out of the ten most costly US climate disas-ters were hurricanes, with Hurricane Katrina being the mostexpensive[? ]. Hurricanes Harvey, Maria, and Irma, which oc-curred in 2017, are expected to be among the ve most costly

Publication rights licensed to ACM. ACM acknowledges that this contribu-tion was authored or co-authored by an employee, contractor or aliate ofthe United States government. As such, the Government retains a nonex-clusive, royalty-free right to publish or reproduce this article, or to allowothers to do so, for Government purposes only.PAW-ATM’18, Dallas, TX, USA© 2018 ACM. 978-x-xxxx-xxxx-x/YY/MM. . . $15.00DOI: 10.1145/nnnnnnn.nnnnnnn

US natural disasters. The utilization of computational modelscan provide local ocials with high delity tools to assistin evacuation eorts and mitigate loss of life and property.Due to the highly nonlinear nature of hurricane dynamicsand stringent time constraints, high performance computing(HPC) plays a cornerstone role in providing accurate predic-tions of ooding. Because of the importance of fast, ecientmodels, there is a signicant interest in improving the speedand quality of these computational tools.

Even as the speed of supercomputers is drastically in-creasing, the end of Moore’s law and the introduction ofmany-core architectures represent a tectonic shift in theHPC community. In particular, the degree of hardware par-allelism is increasing at an exponential rate and the costof data movement and synchronization is increasing fasterthan the cost of computation [? ]. As a result, asynchronoustask-based programming has been of great interest due to itspotential to express increased software parallelism, introducemore exible load balancing capabilities, and hide the cost ofcommunication through task over-decomposition [? ? ? ? ? ?? ? ]. These programming models decouple the specicationof the algorithm from the task scheduling mechanism, whichdetermines where and when each task may execute and or-chestrates the movement of required data between the tasks.Furthermore, lightweight, one-sided messaging protocolsthat support active messages have the potential to reduce theoverheads associated with inter-process communication andsynchronization, which will become even more important asparallelism increases. However, transitioning scientic ap-plications from their current synchronous implementationsto asynchronous, task-based programming models requiresa very high software engineering eort. Additionally, loadbalancing the execution of the resulting task graph remainsa challenge due to the underlying computational hardnessof optimal task scheduling.

This paper examines the potential benets of implement-ing DGSWEM, a discontinuous Galerkin (DG) nite-elementstorm surge code, on a task-based asynchronous executionmodel, and investigates various load balancing strategies forthe resulting program. One of the key aspects of DGSWEMis its ability to simulate coastal inundation during hurricane

PAW-ATM’18, November 2018, Dallas, TX, USA Maximilian H Bremer, John D Bachan, and Cy P Chan

landfall. The DG kernel can be implemented in a mannersuch that dry regions of the simulation require no computa-tional work; however, as the hurricane inundates the coast,this optimization introduces signicant dynamic load imbal-ance. In order to address the load imbalance while still ttingthe problem in machine memory, two constraints (load andmemory) must be accounted for simultaneously.

While multi-constraint graph partitioning tools have beenused to obtain good static partitions for these scenarios, theyperform sub-optimally for irregular applications such as ours.On the other hand, fully dynamic load balancing strategiesare typically based on balancing a single constraint, which isinsucient for hurricane simulation. In order to overcomethese shortcomings, we investigate dynamic and semi-staticmulti-constraint load balancing approaches that can be lever-aged to run hurricane simulations eciently on future su-percomputing platforms.

To circumvent the software engineering eort to portDGSWEM to a task-based model and to allow extremelyrapid evaluation of the resulting execution under a variety ofparameters (e.g. choice of load balancing algorithm), we havedeveloped DGSim, a task-based execution model and discrete-event simulation tool that features asynchronously migrat-able, globally addressable objects. DGSim was designed tonatively model lightweight, one-sided, active interprocessmessaging along with distributed, asynchronously migrat-able objects. Together, these allow the simulation to fullyoverlap both the computation of the new tile placementsand the resulting movement of the tile data with the appli-cation’s main computation. These features are not readilyavailable in other simulation tools but are key features of thefully asynchronous load balancing algorithms introducedhere. The development of DGSim allows us to estimate theperformance of hurricane simulations running with theseadvanced features in a variety of congurations. Crucially,DGSim allows us to simulate the computational behaviorof DGSWEM in an asynchronous task-based runtime usingskeletonization, eliminating the need to execute the costlycomputational kernels, resulting in a lightweight approachthat allows for rapid algorithm prototyping and evaluation.The main contributions of this paper are:

(1) The development of multi-constraint dynamic loadbalancing strategies specically geared for the irreg-ularity associated with the simulation of hurricanestorm surge. In doing so, we present a dynamic andsemi-static algorithm.

(2) The trellis approach for semi-static repartitiong strate-gies which fully leverages the asynchronous natureof the run-time. This approach can be extended toensure ecient semi-static load balancing for a widerange of problems.

(3) The development and validation of DGSim, a discrete-event simulator that enables rapid prototyping andevaluation of fully asynchronous load balancers fortask-based parallel programs.

The outline of the paper is as follows. Section 2 presentsrelated work. In Section 3, we discuss the DG kernel andthe irregular nature of the inundation problem. Thereafter,we outline DGSim, which simulates the performance of anasynchronous task-based implementation of DGSWEM. Sec-tion 6 presents formalism for load balancing in an asyn-chronous context and outlines a distributed diusion-basedand a semi-static load balancing algorithm. Lastly, in Sec-tion 7, we present our model validation and experimentalresults, demonstrating the viability of these load balancingapproaches.

2 RELATEDWORKIn adaptive mesh renement (AMR) codes and hp-adaptivenite elements, dynamic load balancing typically relies oneof three approaches: (1) graph partitioning using space ll-ing curves, known as geometric partitioning [? ? ? ? ] (2)graph partitioning algorithms, such as those provided inthe METIS and SCOTCH libraries [? ? ? ? ? ], or (3) diu-sion or renement based approaches [? ]. The simulation ofcoastal inundation introduces irregularity, which decouplesthe memory and load balancing constraints. To the authors’knowledge, the only paper that uses dynamic load balancingto address this issue is [? ]. However, their approach onlybalances load on structured grids, which may result in mem-ory overow. Local timestepping methods introduce similarirregularity. Seny et al. have proposed a static load balancingscheme using multi-constraint partitioning in [? ]. However,they note the dynamic load balancing problem as an openone. Some examples of load balancing algorithm evaluationsin the context of task-based execution models include theuse of cellular automata [? ], hierarchical partitioning [? ],and gossip protocols [? ].

There has been much previous work on the use of system-level hardware simulation to evaluate how existing applica-tions will behave on future architectures (e.g. [? ? ? ? ? ]). Forexample, the SST-macro simulator [? ] allows performanceevaluation of existing MPI programs on future hardwareby coupling skeletonized application processes with a net-work interconnect model. Previous work has investigated theimpact of various static task placement strategies for AMRmultigrid solvers using simulation techniques [? ]. Evalua-tion of various load balancing strategies using discrete eventsimulation has been conducted in [? ], and the use of par-ticular asynchronous load balancing algorithms has beendiscussed in [? ? ]. However, [? ] examines only a simplegreedy algorithm that ignores communication costs, and [?

Load Balancing for Hurricane Storm Surge Simulations PAW-ATM’18, November 2018, Dallas, TX, USA

] examines an ooad model that still requires synchroniza-tion to enter and exit the load balancing phase. Anotherframework, StarPU with SimGrid [? ? ], allows estimation oftask-based execution on a parameterized hardware model,emphasizing the simulation of heterogeneous nodes withGPUs. However, these papers leave modeling of distributedmemory simulations as a subject of future work. These ex-isting simulators do not natively model one-sided, activemessage communication or asynchronously migratable ob-jects, where the migration of objects happens simultaneouslyduring the execution of computational tasks.

Our simulation approach combines an application taskdependency graph with a performance model to enable theevaluation of an application (DGSWEM) that does not yethave a task-based implementation, allowing us to forecastits performance with dierent load balancing strategies andestimate the benets on large-scale runs. Furthermore, theinterplay between the multi-constraint nature of the stormsurge application and the eectiveness of the load balancingstrategies has not been previously investigated, to the bestof our knowledge.

3 FORECASTING HURRICANE STORMSURGE

During hurricane events, the most dangerous and costliestportion of the hurricane is the surge pushed onto the coastby the wind. In order to accurately assess the impact of ahurricane in coastal regions, accurate storm surge modelingis required. We derive such a model by simulating the seasurface using the 2D depth-averaged shallow water equa-tions. We additionally incorporate the eects of atmosphericpressure changes at the free surface, Coriolis eects, bottomfriction, and a wind forcing source term.

These equations and algorithms are presented in [? ] us-ing a discontinuous Galerkin (DG) method. The solutionis approximated on an unstructured mesh, which providesresolution on the order of 10 meters near coastal regionsof interest. High-order DG methods are attractive due totheir relatively local stencil, high arithmetic intensity, andstability, which allow for explicit timestepping. The topic ofhigher-order DG methods for the shallow water equationsis a topic of active research [? ? ? ? ].

In order to understand the local nature of the kernel, weborrow the language from [? ]

M∂QH

∂t= N (QH ) + S (Q

+H ,Q

−H ), (1)

whereQH denotes the degrees of freedom,M denotes the up-date kernel, N corresponds to the area kernel, which can becomputed entirely locally, and S corresponds to the surface

kernel, which requires computation at the element inter-face. The system of dierential equations is then temporallydiscretized using a Runge-Kutta (RK) method.

The above numerical model is parallelized by distributingelements across a set of concurrent processes (referred to asranks) that cooperate by message passing. In (1), both theupdate kernel and the area kernel can be computed entirelylocally (independent of other elements). The edge kernel isthe only portion of the DG algorithm that may require non-local information, so elements not on the same rank mustcommunicate over the network. With explicit timestepping,this results in a task dependency graph that exposes a largeamount of task parallelism and asynchrony.

One of the key aspects of a storm surge code is its abilityto simulate inundation. Due to numerical artifacts, regionsof negative water column height may occur throughout thesimulation, rendering the shallow water equations meaning-less, both mathematically and physically. To remedy this, anadditional slopelimiter is applied after the update kernel. Weuse the limiter proposed in [? ], which locally examines ele-ments after each update and xes problematic regions. Oneof the key features of this algorithm is its ability to classifyelements as either wet or dry. The performance implicationof this classication is that dry elements require almost nowork, thus as the hurricane inundates the coast, elementsbecome wet in localized regions, causing load imbalance.

4 THE DGSIM SIMULATOROur DGSim simulator utilizes discrete event simulation tomodel the execution of the parallel application. It is expectedto only run skeletonized applications, where heavy compu-tation and large data allocations are omitted for eciency.Every thread in the simulation schedules events into a globalpriority queue keyed on virtual time, so that events are pro-cessed in the correct simulation order. Threads “burn” vir-tual time to simulate the execution of heavy computationaltasks, and message arrival times are delayed to capture com-munication costs. This method permits ecient large-scalesimulation while retaining the salient computation and com-munication characteristics of the program execution. WithDGSim, we are able to rapidly evaluate hundreds of simu-lation trials with dierent parameterizations in 4,000 core-hours compared to over 21,000,000 core-hours had we runDGSWEM itself, roughly translating to a 5,200x speed-up.

DGSim is designed from the ground up to model a very ag-gressive asynchronous execution framework, which is com-prised of three layers. The lowest layer of our framework, theparallel machine, mirrors the communicating multi-threadedprocesses prevalent in current extreme scale computing. Allinter-processes communication is expressed through one-sided, point-to-point active messages, each containing a


bound function. Active messages impose no synchroniza-tion, and thus encourage a very asynchronous methodology.Each process contains a master thread responsible for ex-ecuting active messages, and all other threads belong to apool of worker threads. Messages from other processes areonly executed by the master thread, but any thread may sendmessages to other processes. The dedication of the masterthread to handling incoming messages and managing workerthreads may seem wasteful, but this ts asynchronous ex-ecution well as it ensures there is a core that will remainattentive to servicing incoming messages. Production asyn-chronous runtimes such as Charm++ usually dedicate at leastone CPU core for communication and scheduling.

The second layer of the framework’s stack is our interpre-tation of Active Global Address Space (AGAS) [? ]. This isa software layer that allows the user to place parts of theirapplication state in a named-object store without having toexplicitly manage where objects reside. Users simply visitobjects by sending an active message to a name (as opposedto a process id) and the function will be executed on the pro-cess currently holding the object. To migrate an object, thereis a relocate function that takes a name and a target processid and will cause the named object to migrate from whereverit currently resides to the target. Relocates and visits can beissued concurrently from any number of processes withoutany synchronization whatsoever. This design is similar tothe chare/actor model of Charm++. They too express parallelcomputation as asynchronous method invocations betweenrelocatable objects.

The top layer is a distributed tasking layer, which allowsthe application to describe named units of non-blocking workthat are executed on worker threads. Tasks mainly communi-cate via satisfaction of another task’s dependencies. The taskgraph has no explicit hierarchy (no parent or sub-tasks), andall dependencies are managed through task names. Whena task has computed data required by other tasks, that tasksends a satisfaction containing the data to its successors. Alltask satisfactions are sent through AGAS visits, allowingtask migrations to be issued independently of the task graphscheduler. This separation allows the scheduling logic toreside entirely in the application code, giving it easy accessto pertinent metadata, and enables task migration to occurconcurrently with task execution. Thus we have no need toexplicitly enter or exit a load balancing phase.

5 PERFORMANCE MODEL CALIBRATION5.1 Compute Cost ModelSince DGSim utilizes a skeletonized form of the kernel, wedeveloped a model to estimate the time to execute each task.As there is a limitation associated with the amount of ne-grain parallelism we can expose due to scheduling overhead,

we agglomerate elements together into tiles, whose size issuciently large such that the performance improvementobtained via exposing more parallelism to the system amor-tizes the scheduling overhead. Furthermore, more tiles thanworker threads are assigned to each rank. This oversubscrip-tion of compute resources allows the scheduler to hide mes-sage latencies.

Given a tile T , that consists of the elements Ωe , weapproximate the execution time to advance T by one RKstage, tT by tT =

∑Ωe ωeτe where ωe is 1 if the element is

wet and 0 if it is dry, and τe is the time required to advance 1element by 1 RK stage. Timing DGSWEM using polynomialorder p = 2 with a modal lter and the wetting and dryinglimiter on a single Edison node, we measured τe to be 2.62 µs.

To determine the wet/dry status of elements at each timestep, we use a hurricane simulation of a synthetic stormfrom a FEMA ood insurance test suite throughout this pa-per. The test problem is a 4 day simulation with a timestep of0.25 seconds using a 2-stage RK method on a 3.6 million ele-ment unstructured mesh. Using this benchmark, the wet/drystate of each element is recorded every 1200 time steps inDGSWEM, and the recorded states are then interpolated forintermediate time steps in DGSim.

Due to constraints on the length of simulation, DGSimonly computes a 10th of the total timesteps. This conser-vatively approximates the available work to hide the loadbalancing costs. The change in wet/dry fraction is scaled in aconsistent manner to ensure that the entire hurricane is sim-ulated. The other contributors to compute time are the taskscheduler overhead and the cost to run the load balancer. Weuse timers to measure the actual costs of these computationson the host machine.

5.2 Communication Cost ModelIn order to model the delay incurred by messages sent be-tween ranks in the machine, we dened a hierarchical com-munications model with varying costs associated with send-ing messages across dierent memory levels. On many cur-rent and future memory architecture designs, the memoryon a compute node is split into separate partitions. In sucha conguration, the cost of accessing data from the closestpartition is lower than for distant partitions, thus creatingnon-uniform memory access (NUMA) domains.

Given a message source, destination, and size, our simula-tor’s performance model estimates the delivery delay by sum-ming per-message (latency/overhead) and per-byte (band-width) costs over the path connecting the ranks. For ourexperiments, we utilize performance model parameters sim-ulating a machine similar to Edison [? ]. Table 1 summarizesthe parameters used in our model to estimate message de-livery costs. We conservatively assume that communicationlinks are utilized by all cores that share them (e.g. multiples


Communications Performance ModelLink Latency (per-message) Bandwidth (per-core)

1866MHz DDR3 77 ns 4.3 GiB/sIntel QuickPath 132 ns 11.5 GiB/s

Cray Aries 1.2 - 1.5 µs 0.2 - 9 GiB/sTable 1: Using the relative locations of the source and des-tination ranks, the simulator can estimate each message’stransmission cost.

cores sharing the DDR3 bus). Since DGSim provides a dedi-cated communication thread per socket, the QuickPath andAries endpoints are shared by only one or two threads, andthus would see minimal contention.

All messages are subject to two memory copies over theDDR3 memory bus: one to copy the message from applicationmemory to a communications buer, then an additional copyat the destination from the communications buer to thenal address. For intra-NUMA messages, these two copiesconstitute the entire delivery cost. The measured STREAMbandwidth on Edison is 103 GiB/s [? ], thus the per-corebandwidth is 4.3 GiB/s.

For inter-NUMA messages, messages must additionallytraverse the inter-socket communications bus. On Edison, thesockets are connected via the Intel QuickPath interconnect[? ]. Since no worker threads may send or receive messages,the dedicated communications thread can use the full uni-directional bandwidth (11.5 GiB/s) for sending messages. Thelatency for intra-/inter-NUMA messages is measured usingIntel®Memory Latency Checker–v3.5 with a 200 MB bueron NERSC’s Edison.

For inter-node messages, we parameterize our model us-ing data from performance benchmarks conducted using theCray Aries interconnect [? ]. Although the Aries intercon-nect has a dragony topology, for simplicity, we assume auniform cost of communication between ranks. Figures 7and 8 in that paper specify how the message latency andbandwidth costs vary with message size. Since we simulatetwo communicating ranks per node (one per socket), weconservatively halve the reported bandwidth. Section 7.1presents a validation of our performance model, comparingour modeled execution time versus empirically observedexecution time of a skeletonized DGSWEM implementation.

6 BALANCERS6.1 Theoretical PreliminariesIn order to express our load balancing algorithms, we rstpresent the precise load balancing problem and then intro-duce our model terminology and the trellis load balancingconcept. As mentioned in Section 5, elements are agglom-erated into tiles to amortize the scheduling overhead. Thesimulation consists of many tasks, each one responsible for

advancing a single tile by one RK stage. Based on the DGkernel (1), any edge between two elements on dierent tilesconstitutes a dependency. In order to minimize the number ofdependencies and expose the largest amount of asynchrony,we group elements into tiles using the METIS k-way graphpartitioning algorithm [? ].

Each tile i will have a memory space requirementmi , andan amount of work required to advance the tile by one RKstage will be given by ti . We now introduce the assignmentvariables, χik , where

χik =

1 if tile i is on rank k

0 otherwise,

and χik is subject to the following constraints:nranks∑k=1

χik = 1 ∀ i = 1, . . . ,ntiles. (2)

Ideally, we would solve for time-dependent values of χik thatminimize the total application execution time. However, forasynchronous applications, this is an NP-hard mixed integeroptimal scheduling problem, which is not feasible to solvein situ. Since the compute cost of the tiles dominates theexecution time, we make the simplifying assumption thatthe execution time is approximately proportional to

T = maxk

ntiles∑i=1

ti χik (3)

Additionally, we also have a memory constraint, namely, ifmi is the memory required for tile i and Mk is the availablememory on rank k , then we obtain the additional constraints:

ntiles∑i=1

mi χik ≤ Mk ∀k = 1, . . . ,nranks. (4)

Our optimization problem is dened by (2), (3), and (4). Notethat since the tile’s compute cost changes as the simulationprogresses, the optimal assignment χik is also a functionof the simulation’s progress. Furthermore, the irregularityassociated with the wetting and drying algorithm requiresthat the memory constraint (4) and the execution time (3)are accounted for separately. Before we discuss approachesto approximate optimal assignments χik , we introducesome formalism upon which we will base our load balancingstrategies.

6.1.1 The trellis approach. The main idea of this approachis to execute a second, parallel task dependency graph tohandle load balancing that is completely independent of theapplication task graph. The motivation for this approach istwofold: it decouples the load balancer from the applicationexecution, avoiding the need for costly control synchroniza-tion and it accurately accounts for the improvement in loadbalance with the cost of making load balancing decisions.


First, a common strategy taken by semi-static load bal-ancers is to periodically pause application execution, issuedata and task migrations (rebalancing phase), and resumeapplication execution. This strategy is convenient because itdecouples the times during which tasks and data are allowedto be in transit from the times that tasks are executing andsending messages to each other. Unfortunately, in order toenter and exit the rebalancing phase, control dependenciesare added to the application task graph (often in the formof global synchronization), which can severely impact theperformance of a highly asynchronous execution model. Thetrellis approach introduces an ancillary set of tasks that col-lect the application state, make rebalancing decisions, andlaunch tile migrations, all without adding synchronizationto the application execution.

Another advantage of the trellis approach is the xedfrequency and natural amortization of the cost of rebalanc-ing. Were the frequency of rebalancing instead linked toapplication progress, the frequency of rebalancing would de-crease as the load balance worsens since each iteration wouldtake longer to complete. Additionally, assuming reasonablestrong scaling, doubling the simulation core count would ap-proximately double the frequency of the load balancer eventhough the cost of the load balancer would remain the same.In practice, there is a trade-o between the improvementin run-time due to improved load balance and the cost ofrebalancing. The trellis approach allows us to simplify ourcost model, reducing the numbers of parameters that needto be tuned in order to maintain eective load balancing.

6.1.2 Local models. The second bit of formalism we wouldlike to introduce, are local models. These models capture localstates of the system that can be used to make load balancingdecisions. The use of local models reduces the need to aggre-gate global information; however, it is infeasible to keep allmodel information up-to-date at all times, thus the informa-tion in these models can become stale. If a rank steals a tilebased on stale information, the tile migration could in factnegatively impact the load balance. Since this information isexclusively used for load balancing, the correctness of thenumerical simulation remains unaected. For our problem,we consider 3 types of models:

(1) Tile model: Information such as the location and com-pute load of neighboring tiles.

(2) Rank model: Information such as how much work islocated on a rank (may aggregate local tile models).

(3) World model: Information about the global state ofthe simulation (may aggregate tile and rank models).

The nested nature of these models is represented in Figure1. These models provide representations of the simulationupon which the load balancers make relocation decisions.

World Model

...Rank Model

...Tile Model

Fig. 1: Each world model is able to aggregate rank models,which aggregate tilemodels. This allows for representationsof the system with various levels of completeness.

Listing 1: Asynchronous Diusion Tile Modelstruct TileModel

int my_rank;

map <int ,int > neigh_rank;

double wet_fraction;

double get_gain(int rank);

void local_broadcast ();

;

6.2 Static load balancingBy re-imagining the wetting and drying algorithm as a typeof multirate timestepping, where the dry elements have aninnitely large timestep and the wet elements advance atthe implemented timestep, we use the static load balancingapproach outlined in [? ]. Here the goal is to assign an equalnumber of dry elements and wet elements to each rank. Thisis accomplished by using the multi-constraint graph parti-tioning outlined in [? ]. Note that balancing the memory andload constraints and balancing the wet and dry elements areequivalent formulations of the same load balancing problem.

6.3 Dynamic load balancingDynamic load balancers in our context are dened as loadbalancing strategies that operate in an entirely distributed,asynchronous manner. The methods relocate tiles based ona “load pressure”—the local dierence in loads. The idea isto issue many individual relocation requests based on theselocal observations, and thereby achieve a global balance.

To accommodate both memory and load balance objec-tives, we implement the renement procedure used in [? ].Specically, our asynchronous diusion approach includesno world model, but we include tile models and rank modelsoutlined in Listings 1 and 2.

The tile model consists of the tile’s wet fraction, the loca-tion of neighboring tiles, and a communication gain function,which determines the net impact on inter-rank communica-tion were the tile to be relocated. The rank model aggregatesits tile models to include the tiles bordering the rank, changesin the rank’s load balance, and a list of neighboring ranks.These neighboring tiles are stored in the StealQ. A rank


Listing 2: Asynchronous Diusion Rank Modelstruct RankModel

map <int ,pair <double ,double >> rank_weight;

StealQ boundary_data;

void local_broadcast ();

void update_boundary(int leaving );

void steal_one_tile ();

;

periodically broadcasts its load balancing information toits neighbors, allowing ranks to aggregate the load state ofneighboring ranks. The execution model does not guaranteemessage ordering. In order to ensure correctness, both tileand rank models call a local_broadcast function, whichupdates neighbors’ rank and tile states at regular intervals.The StealQ contains two STL maps for each rank. This datastructure allows us to prioritize, which tile to steal to balancean imbalance in wet or dry tiles from a given rank. The mapsare sorted by the amount of extra communication that wouldbe required after stealing a given tile; we prioritize stealingtiles that minimize this additional communication.

During the execution of a task, the tile checks to deter-mine if the wet fraction has switched between wet and drystates. In the case this happens, the tile’s local broadcastupdates neighboring tile models. AGAS visit is utilizedto ensure that updates are delivered to the tiles no matterwhere they are located. The rank model periodically callssteal_one_tile(). Based on the rank model’s estimationof the neighboring ranks’ work and memory loads, the rankassembles a priority queue PQ sorted by normalized wet-and dry-element counts. Only ranks that are overworked orhave too many tiles will be inserted into this queue; overbur-dened ranks will never steal tiles. Karypis and Kumar notethis stealing heuristic may lead to chatter, which is whentiles get passed back and forth between the same ranks. Tomitigate this, we introduce admissible pressure gradients, i.e.a required minimum imbalance to warrant being insertedinto the pool of candidate ranks to be stolen from. Speci-cally, let the granularity G be the largest number of elementsper tile. For a given rank r , let D (r ) be the number of dryelements on rank r andW (r ) the number of wet elementson rank r . The precise heuristic for inserting elements isshown in Algorithm 1.

Then, we attempt to nd the tile on the most imbalancedneighboring rank, which borders the stealing rank and wouldresult in the best communication gain. If there is no satisfac-tory tile to select, we attempt to improve the most imbalancedconstraint associated with the next most-imbalanced rank.

6.4 Semi-static load balancingWhile the dynamic load balancing procedure allows for fullydistributed load balancing decisions, it is unclear that thiswill result in a good global load balance. The multi-constraint

Algorithm 1 Asynchronous Diusion ThresholdingSet r to current rank IDSet ΣW ←W (r ) +

∑n∈neighbors(r )W (n)

Set ΣD ← D (r ) +∑

n∈neighbors(r ) D (n)for all n ∈ neighbors(r ) do

ifW (r ) + αG <W (n) thenInsert n into PQ (r ) with weightW (n)/ΣW .

end ifif D (r ) + βG < D (n) then

Insert n into PQ (r ) with weight D (n)/ΣD .end if

end for

partitioning problem may result in unconnected partitions [?]. Additionally, the thresholding heuristic is unable to correctgradual–yet large–load imbalances.

To ensure, that we have a good load balance, we considera semi-static approach which periodically rebalances theglobal model. In our formalism, the tile model updates theworld model with its wet fraction at a xed frequency. There-fore, we can trigger a load balance when all the tiles have senttheir information without having to perform any synchro-nization. The global model incorporates all the informationrequired to perform load balancing, i.e. the current tile parti-tion and an approximation to the current wet fractions of therespective tiles. The global model is the only entity whichmay issue relocates, as such the tile partition stored in theglobal model is always accurate.

Semistatic rebalancing involves constructing an updatedtile-to-rank assignment based on the constraints associatedwith the updated wet fractions using the multi-constraintgraph partitioning algorithm [? ]. To keep the master threadsavailable to process messages, we ooad the relatively expen-sive rebalancing operations onto the worker threads wherethe global model is stored. After obtaining this new partition,we potentially have the means to load balance the system.However, the multi-constraint graph partitioner is unableto take into account the previous location of the tiles. As aworst case scenario, the graph partitioner could return thesame partition with permuted partition IDs. The resultingrelocation would require migration of every tile, whereassimply maintaining the old partition we would achieve thesame load balance without disruption. In order to remedythis, we solve a minimization problem which determines apermutation in the global rank ID names, which minimizesthe number of tiles to be migrated. Using a greedy method,we construct a priority queue of old ID and new ID pairsweighted by the number of tiles that reside in both partitions.We then pop the members of the priority queue until all ofthe ranks have been assigned new IDs. This approach canhave a signicant impact on the number of tiles migrated: for


DGSim Simulation ParametersMachine DGSWEM

NUMA domains/node 2 Runge-Kutta Stages 273600Threads/NUMA domain 12 Polynomial order (p) 2CPU clock-speed 2.4 GHz Tiles/worker thread 4

Asynchronous Diusion Semi-StaticRebalance Frequency 40ms Rebalance Frequency 5 sα 1β 2Table 2: DGSim parameters for load balance comparison.

example, during a 1200 core run with 4,400 tiles, the greedyassignment reduced the total number of tiles migrated fromroughly 317,000 to 106,000.

7 NUMERICAL EXPERIMENTS7.1 Empirical Validation of DGSimTo ensure that our simulator is accurate, we compared theDGSim execution times for the Storm36 simulation to askeletonized version of DGSWEM run on NERSC’s Edison.The skeletonized DGSWEM implements the programmingmodel outlined in Section 4. Inter-process communicationis achieved through one-sided asynchronous remote proce-dure calls using UPC++ [? ], and tasks are executed using adataow execution model with a master-worker thread or-ganization. One feature not implemented in the skeletonizedDGSWEM is the AGAS layer. However, for the statically bal-anced problems considered for this validation, AGAS over-head is negligible due to caching of neighbors’ ranks. Byburning identical worker thread execution times, the valida-tion examines whether the messaging and threading over-head models as described in Section 5.1 are reasonable. Theparallel eciencies for the two simulations are shown in Fig-ure 2. Dierences in execution times between the simulationand the skeletonized DGSWEM application do not exceed 7%of the runtime. Furthermore, the DGSim execution time wasslower than the DGSWEM execution time for all simulations,reecting the conservative design of our cost models. Thesevalidation results demonstrate that despite our relatively sim-ple network model, DGSim predicts execution times for thisparticular hurricane simulation with accuracies comparableto more sophisticated approaches, e.g. [? ? ].

7.2 Load Balance ComparisonIn order to compare the load balancing algorithms outlinedin Section 6, we compare performance for the hurricane usedin the previous subsection using a xed machine and algo-rithmic conguration outlined in Table 2. The load balancerparameters are based on parameter sweeps done at 1200 and3600 cores.

To quantify the quality of our load balancing algorithms,we use two performance metrics: the compute intensity, which

1200 2400 3600 4800 6000

Number of Cores

0.0

0.2

0.4

0.6

0.8

1.0

Para

llel

Effi

cien

cy

UPC++

DGSim

Fig. 2: Comparison between the parallel eciency of the sim-ulated performance (DGSim) and the skeletonized applica-tion code (DGSWEM) onNERSC’s Edison supercomputer for1200 to 6000 cores for the Storm36 hurricane using Machineand DGSWEM parameters as outlined in Table 2. Each con-guration was run ve times. The standard deviations for agiven core count were each below 1%.

is dened for a given rank as the fraction of time spent com-puting at a given instant in the simulation, and the imbalance,which is dened as I = maxT−T

T, where maxT is maximum

load on a given rank and T is the average load across thesystem. The combined performance results are shown inFigure 3 and the elapsed times and speed-ups in Table 3.

Since the traditional static load balancing approach solelydistributes tiles to ranks to satisfy the memory constraint,it is expected that the static curve in Figure 3 has a highimbalance. This is reected in Figure 4a, where roughly halfthe ranks are underutilized until the storm inundates thecoast. This load prole is consistent with the 60% paralleleciency observed in Figure 2. The multi-constraint staticload balancer takes load as well as memory into account.This results in much better load balance until the hurricanemakes landfall. Once this happens, the dynamic nature of thecomputational load causes a strong increase in imbalance anda corresponding decrease in computational intensity. Sincethe majority of the time DGSWEM simulates is prior to land-fall, the multi-constraint static mesh partitioning achieves aspeed-up of 1.28 versus the original static partitioning.

Both the asynchronous diusion and semi-static load bal-ancers are very eective at reducing load imbalance. Thecompute intensities shown in Figure 4d and 4c illustrate theranks are being well utilized throughout the simulation. Withroughly equal execution times, both load balancers achieve96% of the ideal parallel eciency. While the semi-staticstrategy migrates signicantly more tiles than asynchronousdiusion, we have observed a small dependence of executiontime in DGSIM on tiles moved. AGAS’s ability to overlapcomputation and tile migration minimizes resource starva-tion. Furthermore, the execution time is mainly determinedby the simulation’s critical path. Thus migrating tiles not onthe critical path will not impact execution time.


Fig. 3: Load imbalance for Storm 36 using 1200 cores with the conguration outlined in Table 2. The following load balancingstrategies were evaluated: static (ST), multi-constraint static (MCS), asynchronous diusion (AD), and semi-static (SS).

Static Multi-constraint static Asynchronous diusion Semi-staticCores TST TMC SST TAD TM σTM SST SMC TSS TM σTM SST SMC1200 2,613 2,048 1.28 1,686 3,654 145 1.55 1.21 1,675 723,109 2,325 1.56 1.222400 1,310 1,058 1.24 849 9,532 189 1.54 1.25 856 681,654 557 1.53 1.243600 876 698 1.26 571 16,614 319 1.53 1.22 585 641,950 568 1.50 1.194800 659 553 1.19 431 24,402 319 1.53 1.28 452 593,709 3,932 1.46 1.226000 528 444 1.19 348 36,955 761 1.52 1.28 373 574,109 1,255 1.41 1.19

Table 3: Performance of load balancers for Storm36. All times are reported in seconds. For each case, 5 runs were performed.The standard deviation of all execution times was found to be below 2% of the mean. T corresponds to the execution time inseconds. TM refers to the number of relocated tiles. The speed-up SX is the speed-up of the simulation relative to strategy Xat the given core count. Due to non-determinism in our simulation, we’ve reported standard deviations of tiles moved σTM fora sample size of 5.

(a) Static (b) Multi-constraint static (c) Asynchronous diusion (d) Semi-static

Fig. 4: Compute intensities of the various load balancing strategies for a 1200 core simulation. For clarity, the ranks have beensorted according to average compute intensity.

7.3 Strong Scaling StudyTo demonstrate that these algorithms are scalable for operationally-relevant core counts, we present a strong scaling study for upto 6000 cores. To obtain an understanding of the quality of theload balancers relative to achievable performance, we denethe parallel eciency E for load balancer LB as ELB = T ∗

ncTLBwhere T ∗ is the serial execution time, nc is the core count,and TLB is the execution time at a given core count. As themaster threads do not compute tasks themselves, the bestattainable parallel eciency is 91.7% = 11/12.

Using the parameters Table 2, the parallel eciencies forthe four partitioning strategies are shown in Figure 5. Notethat by xing tiles per worker thread, the task granularitydecreases as we scale out to higher core counts. Firstly, wenote that the code scales well: the static partitioning strat-egy loses less than 1% parallel eciency across the range ofcore counts. The execution time here should be similar tothat of a fully wetted mesh, and demonstrates the paralleliz-ability of the DG method. Next, the multi-constraint staticpartitioner experiences a slight degradation in performance;EMC6000/E

MC1200 = 92.1%. As the number of cores increases each


1200 2400 3600 4800 6000

Number of Cores

0.0

0.2

0.4

0.6

0.8

1.0

Para

llel

Effi

cien

cy

Ideal

Static

Multi-constraint static

Asynchronous diffusion

Semi-static

Fig. 5: Parallel eciency of the various load balancing strate-gies on Edison for 1200 to 6000 cores.

NUMA domain is assigned a smaller fraction of the mesh.Scaling out, the imbalance becomes more sensitive to lo-cal inundation, causing degradation of the strong scalingperformance.

The semi-static load balancer also loses eciency at scalewith ESS6000/E

SS1200 = 89.6%. Since the rebalance frequency is

xed by the trellis approach, as we scaled out to larger corecounts, the number of times the balancer is called decreases.Furthermore, the cost of repartitioning the tile graph in-creases: at 6000 cores, the METIS calls take approximately3.7 seconds. We suspect that the performance degradation isdue to both the faster rate that imbalance is introduced, aswell as a mismatch between the load balance used to com-pute the repartitioning and the load balance at the time whenthe AGAS migrates get issued. Ultimately, the performanceappears to remain robust, with the semi-static partitionermaintaining a speed-up of 1.19 over the multi-constraintstatic partitioner at 6000 cores.

The asynchronous diusion load balancing scales the bestwith EAD6000/E

AD1200 = 96.9%. Since the cost of rebalancing is

entirely local, the computational complexity of load balanc-ing decisions does not grow with the number of cores. Fur-thermore, since the rebalancing frequency is roughly twoorders of magnitude higher than the semi-static rebalanc-ing frequency, the asynchronous diusion approach doesnot struggle with increased irregularity due to higher corecounts. A common problem with strong scaling of diusion-based algorithms is the persistence of small gradients. Evenwith ner tiles at large core counts, as the rank graph grows,the maximum allowable imbalance grows as well. This didnot appear to be a problem here. However, we do not claimthat this approach would work equally well for other appli-cations. Ultimately, the asynchronous diusion worked verywell providing a speed-up of 1.28 over the multi-constraintstatic partitioning strategy at 6000 cores, and achieved 93.7%of the maximum attainable speed-up.

8 CONCLUSION AND FUTUREWORKHurricane storm surge simulation stands to benet greatlyfrom the dynamic load balancing techniques outlined in thispaper. The irregularity introduced by ooding requires loadbalancers to simultaneously account for both memory andcompute load. We have presented a dynamic diusion-basedand a semi-static load balancing approach for an asynchro-nous, task-based implementation of DGSWEM. To enablerapid prototyping and evaluation of these load balancingstrategies, we simulated DGSWEM using a discrete eventsimulation approach, which allowed us to evaluate congura-tions 5,000 times faster than running the actual code (in termsof core-hours). We found that static multi-constraint parti-tioning gives a speed-up of 1.28 over static single-constraintbalancing. Both semi-static and asynchronous diusion ap-proaches worked very well, achieving speed-ups of 1.2 overthe static multi-constraint partitioning strategy at 1200 cores.The asynchronous diusion approach scaled the most well,maintaining a speed-up of 1.52 over the original static parti-tioning approach at 6000 cores. Furthermore, the asynchro-nous diusion approach migrated an order of magnitudefewer tiles than the semi-static approach.

The irregularity of the hurricane limits the need for moresophisticated approaches. By only simulating every tenthtimestep, we have exaggerated the rate of change of the im-balance by a factor of ten. In practice, we expect the methodsproposed here to completely balance the compute load. Im-plementing these load balancing strategies in DGSWEM is atopic of future work.

The nature of simulating DGSWEM opens up many inter-esting avenues of future research. We plan to model parallelperformance behavior of DGSWEM on future supercomputerarchitectures, e.g. x86 many-core, GPU, and RISC cores. Addi-tionally, we would like to perform a co-design exploration ofalgorithmic and software optimizations in conjunction withthe hardware parameter space. Increased communicationcosts on future architectures will likely play a role in choos-ing between load balancing strategies that trade betweenload balance quality, application communication costs, andtile migration costs.

ACKNOWLEDGMENTSMaximilian Bremer was funded by the U.S. Department ofEnergy’s Computational Science Graduate Fellowship, GrantDE-FG02-97ER25308. This manuscript has been authored byauthors at Lawrence Berkeley National Laboratory underContract No. DE-AC02-05CH11231 with the U.S. Depart-ment of Energy. Furthmore, this research used resources ofthe National Energy Research Scientic Computing Center(NERSC), a U.S. Department of Energy Oce of Science UserFacility operated under Contract No. DE-AC02-05CH11231.


A ARTIFACT DESCRIPTION:SEMI-STATIC AND DYNAMIC LOADBALANCING FOR ASYNCHRONOUSHURRICANE STORM SURGESIMULATIONS

A.1 AbstractThis artifact contains descriptions of the of the discrete eventsimulator used to produce all of the results shown in thepaper. Additionally described are the input data sets as wellas the post-processing steps.

A.2 DescriptionA.2.1 Check-list (artifact meta information).

• Algorithm: Discrete Event Simulation• Program: C++, MATLAB• Compilation: GNU C++11 compiler• Data set: Formatted text les• Run-time environment: Shared partition on NERSC’s

Edison• Execution: Serial• Output: Formatted text les• Experimentworkow:Modify parameters; compile; sub-

mit• Publicly available?: No

A.2.2 How soware can be obtained (if available). Soft-ware is not yet open-source.

A.2.3 Hardware dependencies.

A.2.4 Soware dependencies. In order to compile the code,we require a gnu C++ compiler that supports the C++11 stan-dard. Versions that are known to work are:

• g++ (Ubuntu 4.8.4-2ubuntu1 14.04.3) 4.8.4• g++ (GCC) 4.8.5 20150623 (Red Hat 4.8.5-11)• g++ (GCC) 6.1.0 20160427 (Cray Inc.)

For post-processing, we require MATLAB. In this paper, ver-sion 9.0.0.341360 (R2016a) was used.

A.2.5 Datasets. There are two main input data les: anADCIRC-formatted nite element mesh, and a custom gen-erated le that stores the simulations wet/dry status.

A.3 InstallationN/A

A.4 Experiment workowThe machine and problem parameters are a combination ofhard coded constants and environment variables. Compiletime parameters should be set before compile time. Addi-tionally, compiler directives are used to select load balancingstrategies. For example to run the semi-static strategy, wewould compileflags=-DSEMISTATIC make exe/dgswem.o3

A detailed overview of compiler directives and environmentvariables can be found in the repository’s README.md.

A.5 Evaluation and expected resultThe basic meta-data such as simulation time and numbersof tiles moved is written to the standard output. Computeproling data such as how much time was spent, computing,sending, and idling is written to the timer/ directory. Thisdata is processed via matlab, by executing the st36_proc.mmatlab script located in scripts/visualization.

Load imbalance is stored in the diagnostics folder diag/.This data is processed via the MATLAB script, diag_proc.m.

Due to the self-timing procedure for determining schedul-ing overhead, the simulation results are non-deterministic.Therefore, it is important to run several each simulationseveral times and present measurement variation.

Semi-Static and Dynamic Load Balancing for …...Semi-Static and Dynamic Load Balancing for...

Documents

Transcript of Semi-Static and Dynamic Load Balancing for …...Semi-Static and Dynamic Load Balancing for...