SHF: Medium: Formal Reliability Enhancement Methods for ... · of large simulation frameworks that...

SHF: Medium: Formal Reliability Enhancement Methods for

Million Core Computational Frameworks

Ganesh Gopalakrishnan, Martin Berzins∗

School of Computing(* also Scientific Computing and Imaging Institute)

50 S. Central Campus Dr, Rm 3190University of Utah

Salt Lake City, UT 84112Email: {ganesh,mb}@cs.utah.edu

Web: http://www.sci.utah.edu

August 7, 2012

1PI and Technical point-of-contact. Tel.: (801) 581-3568, Fax: (801) 585-5843.

1

PROJECT SUMMARY

A SHF: Medium: Formal Reliability Enhancement Methods for Million CoreComputational Frameworks

The goal of this project is to contribute a practically applicable body of formal methods to address thechallenge of writing better software for high performance computing problem solving environments. Theintention is to use formal approaches to address the ever-increasing complexity and hence potential fragilityof large simulation frameworks that solve complex multi-physics problems on hundreds of thousands of corestoday and within the next decade will use many millions of cores on exascale computers.

Although the primary aim is to develop, evaluate, and enhance a suitable body of formal methods inthe context of Uintah, an advanced, parallel adaptive multi-physics framework, the resulting approaches andmethodology will have application to many other programs intending to use such large machines. Uintahis component-based, built upon the DOE Common Component Architecture (CCA) and is applicable toa wide range of engineering domains, and has demonstrated scalability up to 196k cores. This project isperhaps one of the first serious attempts to apply a formal approach through a collaboration between theUintah computational scientists and the formal methods researchers with considerable experience applyingformal methods to other supercomputing architectures and software. This collaboration will occur duringthe most opportune time (which is at present) when Uintah is being re-architected to also run on nextgeneration heterogeneous accelerator-based architectures such as DOE’s Titan and the MIC chip-basedStampede Architecture announced by NSF recently.The results of this research will have two primary impacts, including: (i) providing an accurate calibrationof formal approaches proposed in the past with respect to their applicability at scale and in the context ofa realistic setting, (ii) developing formal methods that exploit software architecture and localize debuggingor even avoid creating bugs. This is in response to late cycle bug-hunting being untenable at scale.

It is also our goal to considerably enhance reliability by formalizing and verifying component interactionsat critical internal interfaces. This will directly result in a much more robust Uintah implementation.

Uintah is an asynchronous adaptive task-based software framework and may be viewed as an instanceof the DARPA Silver Model proposed for exascale. Examples of other task-based approaches are Intel’sConcurrent Collections (CnC), the Habenaro framework at Rice, the Charm++ framework at Illinois, andthe Plasma framework at Tennessee. The proposed work will describe Uintah’s internal operations through aformal operational semantics, thereby allowing it to be rigorously compared against existing approaches. Thesemantics will serve as reference against which to compare the behavior of deployed Uintah implementationsusing run-time verification methods. Critical Uintah components such as its data warehouse will be formallyverified using a combination of dynamic and static verification methods and new improved versions createdby using these fromal approaches. We will employ decision-procedure based test generation methods togenerate high-quality tests for critical Uintah components.Intellectual Merit: A formal comparison of the Uintah framework against other task-graph based repre-sentations meant for smaller-scale systems will be obtained. We will focus on Uintah-specific details suchas dependency graph maintenance and data warehouse usage. We will develop a formal understanding ofthe quantities to monitor in such a problem solving implementation. We will judiciously develop differentialverification approaches that focus verification effort on recent changes. We will make the Uintah componentsmore robust by testing them using tests based on concrete plus symbolic (concolic) execution and help todevelop new more robust components.Broader Impact: This project will allow the most appropriate set of formal methods to be developed foruse in large-scale problem solving environments that are widely researched. We will impact practitionersthrough the release of well documented formal analysis tool suites demonstrated in the context of Uintah.This will positively impact our national strategic HPC capabilities whose current vision is to achieve Exascalein a decade. Our project will also help make the much-needed revisions in the CS curricula, allowingmulticore hardware and software solutions to be taught in concert. Our efforts to recruit and mentorstudents from minority and underrepresented groups will further broaden impact.Keywords: Formal verification, MPI, high-performance computing, asynchronous task-graph based speci-fication of programs.

Page A-1

TABLE OF CONTENTS

B Table of Contents

Contents

A SHF: Medium: Formal Reliability Enhancement Methods for Million CoreComputational Frameworks 1

B Table of Contents 1

C Project Description 1C.1 Why not just MPI? Why Prefer Component-based Architectures? . . . . . . . . . . . . . . . . 2

C.1.1 Why Asynchronous Task Graphs? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3C.2 Proposal: Formal Reliability Enhancement of Computational Frameworks . . . . . . . . . . . 3

C.2.1 Uintah . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4C.2.2 Formal Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7C.2.3 Formal Operational Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7C.2.4 Dynamic Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8C.2.5 Symbolic and Concolic Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8C.2.6 Runtime Monitoring and Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8C.2.7 Formal Design for Large Core Counts . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

C.3 Detailed Four-Year Agenda of Proposed Work, Deliverables . . . . . . . . . . . . . . . . . . . 9C.3.1 Year-1: Formalization of the Existing Uintah . . . . . . . . . . . . . . . . . . . . . . . 9C.3.2 Year-2: Rigorous Verification of Uintah Components . . . . . . . . . . . . . . . . . . . 10C.3.3 Year-3: Rigorous Runtime Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . 10C.3.4 Year-4: Interplay Between Correctness and Performance Verification . . . . . . . . . . 10

C.4 Outreach, Dissemination, Student (incl. under-represented) Impact . . . . . . . . . . . . . . . 11C.5 Selected Technical Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

C.5.1 Feasibility Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13C.5.2 Specific Formal Artifacts and Formal Analysis Tools to be Developed . . . . . . . . . 14

C.6 Results from Prior NSF Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14C.7 Collaboration Plans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

C.7.1 Roles for Project Participants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16C.7.2 Project Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16C.7.3 Coordination Mechanisms for Scientific Integration . . . . . . . . . . . . . . . . . . . . 17C.7.4 Budget Support for Collaboration and Coordination . . . . . . . . . . . . . . . . . . . 17

D References Cited 1

E Project Personnel 1

F Organizations Involved in the Project 3

G Collaborators 1

H Letters of Commitment 2

Page B-1

PROJECT DESCRIPTION

C Project Description

High performance computing is of strategic importance for every nation, by providing the basic abilityto conduct scientific research on an ever-increasing scale. In order to accelerate the rate of discoveries inscience, medicine and engineering researchers are making use of the largest computers today and need furtheradvances in our HPC capabilities. As a concrete example, in studying ribosomes, simulating how 2 millionconstituent atoms interact over 2 nano seconds takes about 290 days of simulation on a supercomputer [1].

The design of present and future HPC systems has been severely impacted by the cessation of improve-ments in processor clock speeds. While the move to multicore and manycore architectures is seen as theonly viable option to achieve higher levels of energy efficiency and multi petaflop performance, the pathahead is fraught with challenges. Many organizations and individuals are planning to make several ordersof magnitude improvement even with respect to state-of-the-art machines such as the Japanese K Machine[161] and other systems such as (e.g., Sequoia [3]) and accelerator based systems such as NSFs Stampedeand DOEs Titan that are to be delivered soon. These innovations will involve hybrid types of CPU coresand hybrid models of concurrency, as made clear in the Borkar-Chien article [35]. This move to architectureswith possibly many million cores is going to severely challenge our ability to debug support software andapplication software involved in HPC systems, as made clear in the National Academy study “The Futureof Computing Performance: Game over or Next Level?” [36].

The 2009 Exascale Software Study [4] describes the challenges of using hybrid compute platforms andhybrid software libraries. It asserts: Handling a plethora of such components in a seamless way and allowingprogrammers to pursue efficiency while still providing multiple safety nets are all open challenges, needingthe use of formal methods. However, the use of formal methods after all the design/coding has happened iscompletely untenable at the scale of a million cores. Formal methods must be applied during design. Thisis the main thrust of this proposal.

In future systems built around heterogeneous core and memory types, verification must exploit systemdesign hierarchy. First, with respect to a given abstraction level, we must verify component interactionsfor safety and forward progress. Next, we must verify that the abstractions themselves are correctly imple-mented. This is a crucial point because future large-scale systems will employ many such abstraction layers.Given the wide latency variations at various system interfaces, the use of concurrency of various flavors(fine-grained threading, streaming, and pipelining) will increase, supported by their own APIs. Unfortu-nately, this is where verification methods will also face their biggest challenges, unless they are properlydesigned: interactions between API calls of various types can quickly give rise to extremely large state spaces.Verification tools must help collapse state spaces into much smaller equivalence classes. The abstractionsmust also be designed through suitable component architectures so that the state spaces on either side ofthe abstraction barrier do not interact.

Our proposed work will occur in the context of Uintah, a computational framework under developmentat Utah which follows the above described type of component architecture. In the proposed work, we willdevelop formal methods that can be used during the development of future Uintah versions. With the paceof hardware development anticipated in manycore computing, and with the strong push to show scalabilityresults with respect to cutting edge hardware, even within the four-year span of this project, we will beattempting to port Uintah across multiple multicore CPU and GPU/accelerator types.

The Uintah Software was originally written as part of the University of Utah Center for the Simulation ofAccidental Fires and Explosions (C-SAFE) [174]. Uintah consists of a set of parallel software components andlibraries that facilitate the solution of fluid-structure interaction problems modeled by Partial DifferentialEquations (PDEs) on Structured Adaptive Mesh Refinement (SAMR) grids. Uintah presently contains fourmain simulation algorithms, or components: 1) the ICE [177, 178] compressible multi-material finite-volumeCFD component, 2) the particle-based Material Point Method (MPM) [192] for structural mechanics and 3)the combined fluid-structure interaction (FSI) algorithm MPM-ICE [171, 170] and 4) the Arches turbulentreacting CFD component [191]. Uintah has been used for NSF-funded computer science projects, as wellas other NIH, DARPA, DOD and DOE funded projects (such as angiogenesis, tissue engineering, heartinjury modeling, and blast-wave simulation) both at Utah and other institutions. Uintah is a task-basedasynchronous framework in which communication is overlapped with computation, tasks are executed out-oforder and use is made of multicore architectures with a hybrid MPI-threads model, [187, 166]. Uintah was

Page C-1

PROJECT DESCRIPTION

originally capable of running on 4K processors and has now also been released as open source software1 [173]and extended to run on as many as 196K cores on both NSF and DOE parallel computers (Ranger, Jaguar,etc) in a scalable way on challenging engineering examples through NSF SDCI and PetaApps funding. Bydemonstrating our ideas on a mature computational framework such as Uintah, we will be challenged toadopt only the most practical of formal methods, and ensure that they will work at scale.

Uintah is currently being re-architected to scale into the multi-million core regime in which it will benecessary to employ a hybrid shared-memory/message-passing implementation including traditional messagepassing and shared memory central processing units (CPUs) as well as accelerators such as GraphicalProcessing Units (GPUs) and the Intel MIC chip [5]. This development is exactly in line with what itwill take to achieve Extreme-scale computing: a thousand-fold increase in parallelism and a correspondingincrease in core heterogeneity. This collaboration between the PI and co-PI on this project is most opportunein that the formal methods team will participate with the computational scientists in this process of re-architecting Uintah, and develop, implement, and evaluate formal reliability enhancing methods that willkeep Uintah robust over the coming decade. At the same time the formal methods group will develop amethodology that will be applicable to other codes that aim to use such large core counts.

Our attack will be along three axes: (i) Formal Design: How do we lay out a formal blue-printthat will guide system evolution and also form the basis for specific correctness verification activities?(ii) Formal Verification: Which components of a computational framework can be formally verified orformally tested to minimize their chances of misbehavior during operation? (iii) Formal Monitoring:Given that verification is never perfect (e.g. overlooked assumptions) and is time-bounded, how do wesupport formal monitoring of system operation so that the bug depth—cognitive and temporal separationfrom bug observations to their root-cause—are reduced? Our aim is to test the test the formal designprocess by its use in improving existing components.

Through our work, we will obtain a formal comparison of the Uintah framework with multiple sched-uler implementations including message-passing based and hybrid shared memory/message passing based—against other task-graph based representations meant for smaller-scale systems. We will focus on Uintah-specific details such as dependency graph maintenance and data warehouse usage. We will develop a formalunderstanding of the quantities to monitor in such a problem solving implementation (e.g., task-graphdependencies, the data warehouse, the load balancer, etc.) and select those that facilitate differential verifi-cation the most. Differential verification helps manage verification complexity by focusing verification efforton recent design/code changes. We will make the Uintah components more robust by testing them usingtests based on concrete plus symbolic (concolic) execution and help to develop new more robust components.

C.1 Why not just MPI? Why Prefer Component-based Architectures?

Over the last 15 years,the predominant library used in parallel computing has been the Message PassingInterface (MPI [104]) API. Many users have organized their programs as variants of the bulk synchronousparallel (BSP [6]) model in which the data set being operated on are divided among the processors which areresponsible for computing local data updates, and use MPI calls to exchange data, perform global reductions,etc. Going into the future, one is faced with the enormous complexity of computational problems whichtypically will involve multi-physics codes where multiple interacting problems have to be solved. Manylarge-scale scientific computation frameworks also manage tasks such as load balancing, checkpointing, anduse adaptive mesh refinement to dynamically adapt the spatial mesh to achieve higher solution accuracywhere needed. While great care is often taken in writing such large-scale MPI programs, the sheer scaleof both the codes and the problems they solve is often daunting from a understanding, maintenance anddebugging point of view. This has been the chief impetus behind the development of modern computationalframeworks that have a much higher degree of modularity and component-based architectures.

There is another fundamentally different and equally important reason for preferring computationalframeworks that involve more than just MPI. In the context of Uintah, we have shown that with an MPI-only approach, memory use associated with ghost cells and global meta-data becomes a barrier to scalabilitybeyond O(100K) cores. This limitation has recently been overcome in Uintah through the use of hybridparallelism, significantly reducing the memory footprint of the code per core, [187]. MPI-only codes also

1see http://www.uintah.utah.edu

Page C-2

http://www.uintah.utah.edu

PROJECT DESCRIPTION

cannot easily share data, between processes without the overhead of message passing. The computationalmodel thus inherently does not match the hardware organizations of commodity manycore CPUs with lo-cally shared memory between the multiple cores on a node. One is therefore forced to employ multicore (e.g.Threads, OpenMP or TBB) and/or accelerator-specific languages (e.g., CUDA and OpenCL) in conjunc-tion with MPI. Unfortunately, directly mixing multiple concurrency APIs (“hybrid concurrency” withoutadequate structure) results in code units that have no formal semantic characterization, and potentiallyleading to codes that are not portable and/or contain various concurrency bugs [34]. Even in this situation,code organization around components comes to the rescue. It allows us to clearly delineate the scope ofAPI interactions which are now purely in terms of higher level component API calls. The role of MPIand other concurrency libraries would then be to implement the component API semantics. Many of theseMPI and concurrency library calls can (and are in fact being, within Uintah) automatically generated, thusminimizing the chances of introducing errors, and avoiding tedium.

Collaborations: PI Gopalakrishnan brings to this project his experience developing formal approachesfor large-scale MPI programs. His work is published in a forthcoming CACM article [64] co-authored withother lead university researchers in HPC and also researchers from LLNL and ANL. He presented a half-day Supercomputing 2010 tutorial [7], and will present a full-day Supercomputing 2011 tutorial [9], bothon formal and semi-formal verification methods for MPI programs. The 2011 tutorial is in partnershipwith researchers from Lawrence Livermore National Laboratory, University of Dresden, and Allinea Inc, amajor commercial company specializing in large-scale parallel application debugging tools, and lead by aCTO with considerable formal methods background. Co-PI Berzins brings to this project his experience inHPC through CSAFE and its extensions and has been instrumental in leading the effort that took Uintahfrom 4K cores to 196K cores, both in the development of new algorithms and software. This project alsoinvolves Humphrey, a former research student of the PI in formal methods, who now works as a SoftwareEngineer in the Uintah project under Berzins. We will also hire a post-doctoral fellow who specializes inapplied formal methods, one student in formal methods under the supervision of Gopalakrishnan and thispostdoc, and another student under the supervision of Berzins and Humphrey to work on future Uintahcode architectures and implementations, and create examples of new more scalable components to use asbenchmarks.

Our team includes personnel who can work on all aspects of making the Uintah project a pioneer indemonstrating the positive role that formal methods can play in building the computational frameworks ofthe coming decade. We detail our milestones and student/postdoc mentoring plans in later sections.

C.1.1 Why Asynchronous Task Graphs?

We are interested in computational frameworks that are typically created around higher level notions suchas asynchronous task graphs, roughly following the Silver model [13]. There has been a rich legacy ofwork in this area including the SMARTS dataflow engine that underlies POOMA [12], the Cactus frame-work [14], SISAL [15], Charm++ [16], TBLAS [17], Scioto [18], Uintah [188] and Concurrent Collections [19].Asynchronous task graphs provide a natural and efficient way of exposing parallelism and managing com-putation and are also a mechanism for scheduling computation more efficiently. While our research willimpact task-graph-based frameworks it will also be applicable to other framework approaches designed forexascale architectures providing that they are component-based. In addition the broad use of frameworkse.g [169, 185, 194, 195] such as Uintah within the NSF and DOE and DOD communities will also involveus in a vibrant research community that will care about and utilize the solutions we develop [162].

C.2 Proposal: Formal Reliability Enhancement of Computational Frameworks

We now provide background sufficient to understand the yearly work plans of our proposal. While describingthe background, we will also provide brief descriptions of new work to be done (summarized in Figure 1).Two students will work on two complimentary but different aspects of the project. One will work on theformal methods development for the existing code while the other will use the approaches developed bythe first to explore the extension and development of key components such as the load balancer in a proof

Page C-3

PROJECT DESCRIPTION

Item Y1 Y2 Y3 Y4

FocalItems

FormalizeUintah Subsystemsas TaskGraphs

FV forUintahComponents

FormalRuntimeMonitoringMethods

CorrectnessPerformanceInterplay

CodeImpact

Requirementsfor NewLoad Balancing

FormalDesign ofNewLoad Balancing

PerformanceEvaluationofLoad Balancing

DemonstrationofLarge-scaleRuns

CodeImpact

UpgradeUintah Impl.withAnnotations

Apply FV;Integrate-backHardenedComponents

GenerateR.T. MonitoringCode; Integrateinto Uintah

GeneratePerf.CalibrationTests

ScienceImpact

Aspect OrientedProgramingFor Task-Graphs;Formal OpSem.for HierarchicalTask-Graphs

FormalApproaches forLarge-scaleConcolicand DynamicVerification

Algorithmsto CompileOpSemand Aspectsinto R.T.Assertions

Methodologyfor DebuggingPerformanceandCorrectness

Figure 1: Four-Year Work Plans

of concept of the formal design process. A 2-year postdoc will help achieve significant progress during Y1and Y2; we will attempt to keep the postdoc through additional funding beyond Y2. These work items areelaborated in the form of a four-year research agenda in § C.3.

C.2.1 Uintah

Figure 2: Architecture of Uintah hybrid task scheduling system

Component Architecture and Task-Graph Based Execution: The aim of Uintah was to be ableto solve complex multiscale multi-physics problems, such as the benchmark C-SAFE problem. This is amulti-physics, large deformation, fluid-structure problem consisting of a small cylindrical steel containerfilled with a plastic bonded explosive (PBX9501) subjected to convective and radiative heat fluxes froma fire, [170]. In order to solve many such complex multiscale multi-physics problems [175, 172, 179, 167]

Page C-4

PROJECT DESCRIPTION

Uintah makes use of a component design that enforces separation between large entities of software andcan be swapped in and out, allowing them to be independently developed and tested within the entireframework.

Uintah is novel in its use of a task-based paradigm, with complete isolation of the user from parallelism.The individual tasks are viewed as part of a directed acyclic graph (DAG) and are executed adaptively,asynchronously and now often out of order [186]. This is done by utilizing an abstract task-graph represen-tation of parallel computation and communication to express data dependencies between components. Thetask-graph is a directed acyclic graph of tasks. Each task consumes some input and produces some output(which is in turn the input of some future task). These inputs and outputs are specified for each patch ina structured Adaptive Mesh Refinement (AMR) grid. Adaptive structured meshes consist of hexahedralpatches, often of 153 elements, [57] that hold the problem data. Associated with each task is a C++ methodwhich is used to perform the actual computation. Each component specifies a list of tasks to be performedand the data dependencies between them. The task-graph approach of Uintah shares many features withthe migratable object philosophy of Charm++ [176]. In order to increase efficiency, the task graph is createdand stored locally, [165].New Work: We will develop formal approaches to describe all of Uintah’s subsystems—users’ computations,scheduler, load balancer, AMR modules—uniformly using task graphs. At present, only the user compu-tational tasks are described inside Uintah explicitly as task graphs. By formalizing the essential interfaceaspects of Uintah subsystems as task graphs during Year-1 (Figure 1), everyone involved—students, post-docs, and staff Humphrey—will obtain a common as well as formal understanding of the existing Uintahsystem. The primary motivation for doing this holistic task-graph based description is to have some sharedunderstanding on the emergent behavior of the entire Uintah system when all components interact—a factthat is never documented in most complex systems. We will document in our description those detailsspecific to various subsystem implementation versions versus those details that are expected to remain in-variant across future Uintah code evolutions and implementations. With this blueprint available, interior“firewalls” within Uintah can be erected, supported by run-time monitoring methods.

There is a very strong reason why such firewalls must be part of every computational framework. Sim-ulation users will want to run their simulations on the latest hardware, and with respect to nascent Uintahcode-bases that may crash. When challenged to understand these crashes, Uintah developers rely on toolssuch as DDT [20] available on machines such as Jaguar [21]. While debuggers are good at reading core-dumps and allowing users to step through traces, they are often not easy to use at scale in contexts involvingmultiple APIs, where the core-dump sizes are large, and the bug depth is large. They are often unable toprovide intuitive reasons as to why things failed. In contrast, our ‘formal firewalls’ are meant to be at a muchhigher level than core dumps, and will be designed to keep track of causal chains spanning multiple higherlevel activities to vastly reduce debugging effort. This can cut down the time delay from code developmentto achieving stable long-running simulations, improving the overall utilization of expensive HPC resources.

A strong scientific outcome will be a formal understanding of how to weave various task-graphs usingnew flavors of Aspect Oriented Programming [22] suitable for cross-cutting issues pertaining to task-graphs—something we have not come across. We will publish on what worked and what did not, to the benefit ofthe larger community engaged in building computational frameworks.

Task Scheduler: Uintah’s task scheduler is responsible for computing the dependencies of tasks, de-termining the order of execution and ensuring that the correct inter-process communication is performed,[165]. In this mapping communication is automatically overlapped with computation. Originally, Uintahused a static scheduler in which tasks were executed in a pre-determined order. This caused delays whena single task was waiting for a message. The new Uintah dynamic scheduler changes the task order duringthe execution to ensure that processors do not have to wait, [186]. This scheduler required a large amountof development to support the out-of-order execution, which produced a significant performance benefit inlowering both the MPI wait time and the overall runtime. The dynamic scheduler utilizes two task queues:an internal ready queue and an external ready queue. If a task’s internal dependencies are satisfied, thenthat task will be put in the internal ready queue where they will wait until all required MPI communicationhas finished. As long as the external queue is not empty, the processor always has tasks to run. This canhelp to overlap the MPI communication time with task execution. This approach reduces MPI wait times

Page C-5

PROJECT DESCRIPTION

significantly, as shown in [165, 186].New Work: We will focus on formally understanding the spectrum of task schedulers already created forUintah: the original serial scheduler that synthesizes a static schedule; the dynamic out-of-order schedulerthat can fill otherwise unused scheduling slots; and the most recent multicore scheduler that can employthreads. We will develop formal verification methods for these schedulers, and also make our methodscapable of handling future variants of schedulers. These schedulers may, for instance, be designed toaccommodate future many-cores and accelerators such as GPUs, as well as future versions of MPI (e.g.,MPI 3.0 [23]) that will allow non-blocking collectives. We will also be developing formal monitoring methodsto ensure the correctness of the schedules generated. For example, we already have managed to extract fromthe present Uintah task graph the following concerns:• The scheduler is invoked per time-step.• Tasks move from the internal to the external ready queue (Figure 2) in response to MPI messages• MPI message receives are done by one controller thread; however the sends are done by multiple

threads. Therefore, Uintah depends on its underlying MPI runtime providing the ability for multiplethreads to be making MPI calls. It is important for Uintah to make sure that this is true before itruns.

• As to whether the MPI libraries themselves are correct, we rely on the kind of formal methods researchsuch as we have done in the past [121]. The PI has discussed verification of MPI thread-multiplesupport, and the algorithms that this involves, with the authors of [24]. We are well positioned todevelop a battery of formal tests to thoroughly test MPI library implementations before Uintah runsuse them.

• We already have extracted a finite-state machine describing legal state transitions expected to occurwithin a significant class of multithreaded Uintah schedulers. These state machines will be describedat a high level and translated into actual runtime monitoring code [25].

Data Warehouse: The core scheduler component stores simulation variables in a data warehouse. Thedata warehouse is a dictionary-based hash-map which maps variable name and patch ids to the memoryaddress of a variable. Each task can get its read and write variable memory by querying the data warehousewith a variable name and a patch id. The task dependencies of the task graph guarantees that there are nomemory conflicts on local variables access, while variable versioning guarantees that there are no memoryconflicts on foreign variables access. These mechanisms have been implemented for supporting out-of-ordertask execution in our previous work using a dynamic MPI scheduler [186]. This means that a tasks variablememory has already been isolated. Hence, no locks are needed for reads and writes on a tasks variablesmemory. However, the dictionary data itself still needs to be protected when a new variable is created oran old variable is no longer be needed by other tasks. As dictionary data must be consistent across theworker threads, the data warehouse has to be modified to be thread-safe by the addition of read-only andread-write locks. When a task needs to query the memory position of a variable, a read-only lock must beacquired before this operation is done. When a task needs the data warehouse to allocate a new variable,or to cleanup an old variable, a read-write lock must be acquired before this operation is done. While thisincreases the overhead of multi-thread scheduling, locking on dictionary data is still more efficient way thanlocking the all the variables.New Work: The formal design of thread-safe data warehouses will be a focal item, requiring formal verifi-cation support. We will also be developing formal monitoring methods that keep track of the integrity ofthe data warehouse. We now list some of the correctness concerns we have encountered as well as plan toformally treat:• On certain Operating Systems, Pthreads does not allow a read lock to be promoted to a read-write

lock without first releasing the read lock. This creates a race vulnerability window when no lock isbeing held. Luckily, the Uintah designers were aware of this situation and have correctly handled it;however these are precisely the types of issues that can result in nasty bugs (similar in complexity tothe Photoshop bug [26] that took nearly a decade to resolve). We will develop methods to formallyverify data warehouse implementation routines against data races and violated atomicity.

• We plan to have multiple versions of data warehouses, some designed for multicore CPU usage andothers for GPU/accelerator usage. While techniques for efficient memory management used in these

Page C-6

PROJECT DESCRIPTION

Executable

Proc1

Proc2

……Procn

Potential Matches

Run

Native MPI

MPI Program

DMA -PnMPI

modules

Schedule Generator

Epoch Decisions

Rerun

Figure 3: The DAMPI Framework (left) and the GKLEE Framework (right)

implementation types are different, the external logic of the data warehouses will be the same at a highlevel. These are examples of good opportunities to apply differential verification [27] which focusesverification effort around specific code changes.

• Our present GPU-based implementation marshalls data into a flat structure and employs calls such ascudaMallocPitch() to reduce non-coalesced memory accesses. We will extend our concolic verificationmethods (to be discussed around Figure 3(right)) to model the semantics of these calls and correctlyanalyze GPU/accelerator code for safety and performance.

• We will also research possible alternative mechanisms for GPU data accesses within tasks, as studiedin CnC-CUDA [28].

C.2.2 Formal Methods

Formal methods span the entire spectrum of rigorous methods pertaining to the functional and performancecorrectness of complex systems. Historically, the predominant slant in academic formal methods researchhas been to address correctness (e.g., does the code compute the expected answers?) or detect symptomsthat suggest that correctness may be compromised (e.g., does the code involve a data race or is behavingnon-deterministically in a case where determinism was required?). Modern formal methods addresses timingand performance in various ways, e.g. through various timed automaton formalisms [29], in terms of task-graph scheduling algorithms [30], etc. It is clear that performance is the harder of the two objectivesbecause, unlike correctness which tends to be defined largely by the actual code, performance is a complexfunction of how code interacts with the entire machine and its memory, network, and IO subsystems. Yet,it is crucial that formal methods researchers provide a handle on performance estimation, especially if it isobtainable as a natural byproduct of their correctness analysis framework. We have already demonstratedsuch a capability in our past work [89] where we have shown that the same kinds of symbolic representationsthat help answer GPU code correctness can also help estimate the degree of memory bank conflicts. Thisproposal will weave in practical approaches to performance alongside concurrency checking. We now listsome of the focal tasks in the proposed work.

C.2.3 Formal Operational Semantics

A formal approach to describe system behavior is through operational semantics. In previous work, we haveshown that this approach is practical for real-world APIs: in [93], we provide a detailed formal operationalsemantics covering 150 of the 320 MPI 2.0 standard functions in TLA+. Others have similarly specified thesemantics of feather-weight CnC [94] and the Windows API [95], for example.New Work: The proposed work will describe Uintah’s internal operations through a formal operationalsemantics §C.3. This will serve as the basis for developing formal verification and formal monitoringapproaches, and also contribute to science by rigorously comparing our work against other task-graph basedapproaches.

Page C-7

PROJECT DESCRIPTION

C.2.4 Dynamic Verification

Dynamic verification methods are popular in contexts where much of the overall execution semantics aregoverned by the API and runtime semantics—things outside of the actual user source code. We can seeexamples of dynamic formal verification approaches deployed to handle codes that interact with C libraries(the Verisoft [59] tool), the Windows API (the CHESS [106] tool), the Java library (the Java Pathfinder [69]tool), or Distributed Systems (the MODist [70] tool), for instance. By instrumenting API calls, these toolsare able to momentarily transfer control to a “verification scheduler” that can examine the interleavingbeing elaborated, and prune or otherwise alter it in order to eliminate redundant (equivalent) searchesor heuristically control search complexity. In a recent effort [150], we showed that these approaches canalso be implemented in a distributed manner, and elaborated the design and results from our tool builtcollaboratively with LLNL called DAMPI (distributed analyzer of MPI, see Figure 3 (left)). DAMPI is thefirst MPI-program dynamic analyzer that can formally analyze the MPI communication non-determinismspace of unmodified 1000-process MPI C or Fortran applications, and runs on the actual cluster machinewhere the final application will be run.New Work: Frameworks such as DAMPI are designed to control the execution of programs when viewedfrom the vantage point of MPI calls. Unfortunately, this is too low-level a vantage point, which causesdynamic verification tools to be far less effective in interleaving reduction. The overall semantics of com-putational frameworks are dictated by the much higher level semantics of component-level API calls. Inthis work, we will develop scalable dynamic verification methods for computational framework components.We will exploit the higher level component API call semantics to prune redundant interleavings much moreeffectively.

C.2.5 Symbolic and Concolic Verification

Symbolic analysis of programs is one of the fastest growing areas of formal methods. This is in direct responseto the fact that the underlying constraint solving engines (otherwise known as SMT [90] solvers)—decisionprocedures for decidable fragments of first order logic—have improved in performance by several ordersof magnitude. Some of the many achievements using modern SMT solvers are the following: (i) attacks(bad test inputs or exploits) on browsers and parsers [42], (ii) rigorous testing of software modules is beingachieved [79, 87, 44], with some of the efforts such as KLEE [39] and EXE [41], finding adoption andextension in at least seven other tools [40]. Our contributions to this area have come in the form of twotools—PUG [89] and GKLEE [31]—both capable of finding bugs in practical GPU codes, including CUDASDK code modules exceeding a few thousand lines of code. The work-flow used in these tools—concolicexecution—is illustrated in Figure 3(right).New Work: In this proposal, we will develop rigorous concolic execution based verification methods forUintah components, developing heuristics, problem encoding techniques, and specialized SMT-solving tech-niques to handle the scale of problems. One specific problem was already mentioned—namely, to applyconcolic testing to Uintah components while correctly modeling special-purpose library calls. We will ex-tend our tool GKLEE for this purpose, also extending it to model concurrency libraries.

C.2.6 Runtime Monitoring and Verification

We will develop a formal understanding of how a system of task-graphs can be specified hierarchically,and how to define their joint behavior. The process of obtaining these task graphs can only be partiallyautomated. For example, each Uintah component decomposes the steps of an algorithm and the datadependencies between those steps into tasks. Each task defines a single step of the algorithm along with thenecessary dependencies. The dependencies describe both the inputs and outputs of each task. The inputsto a task are referred to as requires and specify which variables (along with a stencil width) the task needsto execute. The outputs of a task are referred to as computes and specify which variables the task produces.New Work: In the proposed work, we will create incentives for designers of every new Uintah component orsubsystem to provide an auxiliary layer of specification characterizing the component. The main incentivewould, of course, be the rewards they reap by either catching more bugs or by being able to better understand

Page C-8

PROJECT DESCRIPTION

evolving designs. The importance of describing these “cross interactions” becomes clear by focusing on twoscenarios:

AMR-time Interactions: Whenever the simulation grid changes, a series of steps must be taken priorto continuing the simulation. First, the patches must be distributed evenly across processors throughload balancing. Second, all patches must be populated with data either by copying from the sameregion of the previous grid or by interpolating from a coarser region of the previous grid. Finally, thetask graph must be recompiled, [180, 183].

Load-balancing Interactions: With particle based methods, the workload can change on each time stepas particles move throughout the computational domain. This can necessitate that load balancingoccurs often, making it important that the time to generate the patch distribution is small relative tothe overall computation. If a slow load balancing algorithm is used and load balancing occurs often,the time to load balance can dominate the overall runtime. In this case, it may be preferable to use afaster load balancing algorithm at the cost of more load imbalance. In addition, the time to migratedata between processors may also become significant when load balancing often, [186].

We propose that these cross-couplings be captured using an Aspect Oriented layer. We will researchways to weave cross-cutting concerns across task-graph based descriptions using a constraint specificationlayer. Working with Uintah developers who have a detailed understanding of the present system architecture(in some cases they have a 6-year or longer association with the Uintah code), we will develop practicalapproaches for achieving this goal (ensuring that the annotation overhead is not too onerous), and field-testing our ideas by applying it to existing the Uintah framework.

C.2.7 Formal Design for Large Core Counts

A key validation of the above approach will be to use the underlying formal methodology to improve theperformance of key components of the software. The present approach used to load balance the calculationis based upon a very fast space filling curve algorithm that is not however distributed across all cores.At present for meshes with less than 1 Billion elements, this algorithm performs well and has a very lowoverhead, [181]. In going to the next generation of problems and computers this algorithm must be revised towork in a more distributed fashion across more cores. In the modification of these algorithms we will adoptthe formal approach outlined above at a very early stage to ensure, as far as this is possible, the generationof correct code. Similarly in the load balancing phase it will be important to ensure that the load balancerworks across the heterogenous architectures with both CPUs and GPU/accelerators. Again the existingmethodology [183] works well, but will need to be extended to the next generation of both problems andarchitectures. In particular this will need further work to ensure that the very different balance of tasksand processors is properly addressed. The formal monitoring approach will be used in the design of thisimproved load balancer. The final aspect of the way that we can use a formal approach is to redesignand, as much as possible, eliminate the very sparse global mesh data structure that is stored at present.Although the move to a multi-core hybrid approach has reduced the impact of this data structure by afactor of ten [187], the move to even larger problems will make it desirable to move to a more hierarchicaldata structure that is more distributed and based more on a spatial mesh region being associated with acomputationally intensive node. The design and testing of this component will again make use of the formalapproach outlined above.

C.3 Detailed Four-Year Agenda of Proposed Work, Deliverables

The proposed work will occur over four years, and will address the following activities, each described underits own section heading with a highlight of the work that year (but with the understanding that this workwill spread suitably over the other years also).

C.3.1 Year-1: Formalization of the Existing Uintah

• Develop our hierarchical task-graph notation spanning all Uintah computations as well as users’ task-graph based component code. Include both timing and performance aspects (Page 5).

Page C-9

PROJECT DESCRIPTION

• Develop the Aspect Oriented Layer to formally compose aspects of Uintah components (Page 5).• Develop formal characterizations for Uintah schedulers.• Study requirements for advanced load balancing of both the initial and existing meshes as well as

scalable mesh data structures.

C.3.2 Year-2: Rigorous Verification of Uintah Components

• Based on the formalization, we will have formal specifications for the Uintah components. As casestudies, we will apply existing concolic testing approaches to threaded versions of the Uintah datawarehouse, and improve the scalability of our concolic tools.

• We will examine Uintah schedulers using dynamic verification methods, again developing strengthenedand domain-specialized dynamic verifiers.

• One of the desired properties of a well designed component architecture is interchangeability of com-ponents. However, there is no formal methodology that we are aware of that ensures that a givenswap is successful. Given that we have access to at least three Uintah schedulers, we will formulateand study this problem, developing checkable criteria that ensure interchangeability.This will be a crucial result with respect to our future evolution roadmap of Uintah where multipleAPIs, hardware platforms, and scheduler-types will have to be considered in solving newer problemtypes and at different scales.

• Begin improvements to load balancing algorithms and mesh data that will distribute the computationslocally and remove the need for mesh meta-data using formal techniques to ensure correctness.

C.3.3 Year-3: Rigorous Runtime Monitoring

• We will develop formal run-time monitoring by following a compilation approach. Much like thepresent situation where the user-specified requires and computes are turned into a compiled sched-ule, we will develop compilation approaches that generate the run-time monitoring code from userspecifications of cross-coupling aspects that weave the different task-graphs. Run-time verification isgaining in practical importance as a practical way to deploy formal methods in the field, especiallyin the context of mission-critical software. A corresponding methodology in the HPC space will be asignificant advance for future computation framework designs.

• We will investigate how to link the run-time monitoring frameworks we develop with existing debuggerssuch as DDT [20], MUST [11], and Totalview [10]. Our strong working relationships with the EclipseParallel Tools Platform (elaborated in § C.4) give us the necessary impetus in this direction.

• Use rigorous runtime monitoring to ensure that both the existing load balancing and new improvedversions perform as expected in a scalable way on increasingly large core counts.

C.3.4 Year-4: Interplay Between Correctness and Performance Verification

• We will investigate symbolic methods for first-order estimators of code performance [8]. One directionis to develop static analysis methods that predict whether a unit of computation has the possibility ofexhibiting anything but a linear growth in computing time with respect to chosen problem parameters.A guiding philosophy behind all the scalability achievements of Uintah has been the manually enforceddiscipline of sticking to the linearity principle. We propose to automate this checking using formalmethods.

• As opposed to a static approach that may generate false alarms, we will investigate whether there areautomatic ways to generate a suite of concolically generated tests with respect to which to evaluatethe timing profiles of Uintah components dynamically. This again is to detect lurking performancebugs early.

• We will integrate our timing/performance visualization methods into existing tool frameworks, againsimilar to our prior work with PTP described in § C.4.

• In our GKLEE tool, we have employed symbolic methods to predict three classes of GPU inefficiencies:(i) warp divergences, (ii) memory bank conflicts, and (iii) non-coalesced shared memory accesses. Infact, our tool can read in a CUDA Compute Capability description (whether it is a 1.x or a 2.x family

Page C-10

PROJECT DESCRIPTION

of GPU cards that is being programmed) and appropriately calculate these performance predictors.Our present approach suffers from the fact that our estimates are static—for instance, “the numberof GPU kernel barrier intervals over which a performance problem is detected.” We will develop waysto turn this into a dynamic estimate by calibrating the severity of performance issues over user-givenand/or automatically generated data sets.

• Demonstration of the large scale performance of the verified approaches will be undertaken using NSFXSEDE large scale parallel computing resources, e.g such as the proposed 15 petaflop Intel MIC chipStampede machine at TACC in Austin that NSF has announced in September 2011.

C.4 Outreach, Dissemination, Student (incl. under-represented) Impact

Student Mentees: The PI and co-PI make every effort to recruit minority students.The following UGs were supervised by the PI during the past five years (mainly through REUs): Steve

Barrus (Intel), Joseph Mayo (Microsoft), Geof Sawaya (MS student), Michael DeLisi (Amazon), AlanHumphrey (Uintah/SCI staff), Grant Ayers, Chris Derrick (StorageCraft Inc.), Michael Bentley, JasonWilliams (L3 Communications), Tyler Robinson, Brandon Gibson. He conducted a Science Camp http:

//www.cs.utah.edu/~ganesh/science06.html for 6 female students (7th grade), and mentored a femalehigh-schooler Cawthon:• DeLisi authored 7 papers while a BS/MS student, was in the short-list of the Outstanding CRA CS

UG Finalists (ranked in the top 6), and was voted Outstanding CE Student during year 2009.• Ayers won the Outstanding CE UG award during 2011.• Humphrey won the Outstanding CS UG award during 2011. He was voted an official committer to

the Eclipse Parallel Tools Platform (PTP) repository. He has make substantial contributions to PTPwith regard to ISP/Dynamic Verification (see his [73] co-authored with Derrick). He has helped theIBM PTP team during their Supercomputing tutorials during 2009 and 2010, and plans to interactwith them during their 2011 tutorial also. He was Outstanding SC Student Poster Runner-up during2010.

• Gibson presented a student poster during Teragrid 2011, and will do so during SC’11 also.• Ellie Cawthon (now freshman, Pomona) was mentored by Sawaya, doing an internship in formal meth-

ods/parallel prog. (http://eparallel.blogspot.com/), having no prior exposure to parallelism.• Both Bentley and Gibson made it to the shortlist of the Bluewaters UG Summer Program.

Hosting MSR’s PPCP: The PI hosted the first version of Microsoft Research’s Practical Paralleland Concurrent Programming http://research.microsoft.com/en-us/projects/ppcp/ course at Utahhttp://www.eng.utah.edu/~cs5955/, employing his UGs Bentley, Sawaya, and Mayo.

Center for Parallel Computing at Utah: The PI has helped put together (and now directs) the Centerfor Parallel Computing at Utah (“CPU”), approved by the University Senate in late 2010. The co-PI isan active member of the CPU center. They both collaborate with their colleagues active in HPC andreliability/parallel compilation, especially Profs. Robert M. Kirby and Mary Hall. They helped organize aDistinguished Lecture Series, and other joint activities spanning the campus http://www.parallel.utah.edu. The PI and Siegel (Delaware) gave an invited tutorial at VMCAI 2011 on this topic [137] on HPC andFV.

Tool Dissemination: The tools developed in this project will be evaluated through the PI’s and co-PI’s collaborators. The Uintah software is being disseminated through the SCI Institute, as described atlength in our Data Management Plan. In addition we will distribute it through other channels, especiallyexploiting PI’s recent Utah Faculty Associate status in the DOE Institute for Sustained Performance, Energyand Resilience (SUPER).’ We will also evaluate our software through our industrial contacts, includingLecomber with whom the PI is giving an SC’11 tutorial.

Page C-11

http://www.cs.utah.edu/~ganesh/science06.html

http://www.cs.utah.edu/~ganesh/science06.html

http://eparallel.blogspot.com/

http://research.microsoft.com/en-us/projects/ppcp/

http://www.eng.utah.edu/~cs5955/

http://www.parallel.utah.edu

http://www.parallel.utah.edu

PROJECT DESCRIPTION

C.5 Selected Technical Details

Figure 4: Task Graph for the MPM ICE Component

Formal Operational Semantics: We will develop a formal operational semantic description for Uintah’stask graph models. For user dataflow graphs (an example of which is shown pertaining to the ICE compo-nent, in Figure 4, courtesy of [180]) the semantics will formally describe the abstract state model chosenand the evaluation (“small step”) semantics. Our past experience writing different operational semantics forMPI [149, 93] will allow us to choose the right level of abstraction. For the remaining components—namely,the scheduler, load balancer, and AMR modules—we will employ appropriate state models and small-stepsemantics. These formal descriptions will be in an executable notation such as [114] to permit debuggingthrough finite-state model checking, and verifying putative queries submitted to the model. Higher levelnotations such as Maude [115] will also be investigated. Another option to be considered is Live SequenceCharts (LSCs, [118]), described in the sequel, below.

Aspect Oriented Programming: Our focus here will be to understand how to specify aspects that cutacross dataflow descriptions. Also, once these aspects are specified, our focus will be to develop synthesistools that can take these aspects and generate run-time monitoring code.

We will research techniques that can help us relate events from different dataflow graphs. Let usconsider the examples mentioned in § C.2.6 once again. The specific questions will be to capture AMR-timeinteractions, Load-balancing interactions, and similar cross-subsystem interactions. We will develop suitablecross-cutting constraints expressed themselves as flow-graph snippets that will help bind various interfacesinto a cohesive whole. The differences with the other flow-graphs will be several:• It will involve a subset of events described in the vocabulary of each individual dataflow graph• We will investigate the use of Live Sequence Charts that can express hot (required) and cold (optional)

behaviors elegantly.We will exploit our past work on LSC-based verification and synthesis [116, 117] and generate monitors (inthe form of C++ code) that observe/validate cross-cutting behaviors. Given the graphical nature of LSC’s,we believe that Uintah developers will be encouraged to intuitively view and modify these descriptions.

Differential Verification: Once a new Uintah component has been validated and its new interface spec-ification obtained, we will re-verify the role of the new component in the overall context specified by ourcross-cutting constraints. Thus in our approach, there will be far fewer (if any) “push-button” checking,but a simple but effective methodology that can be followed. The steps of the methodology are:• Design the new component.

Page C-12

PROJECT DESCRIPTION

• Validate it against the test suite generated from the original interface specification. This validationwill use a combination of concolic testing and dynamic verification. Tools such as SAGE [42] andsymbolic fuzzing techniques will be employed to enrich test cases.

• Update the interface specification of the new component.• Validate the new component together with the old components against the cross-cutting LSC behav-

iors.

Performance Evaluation, Prediction, Symbolic Analysis: Symbolic analysis for performance hasbeen well studied in [8]. Our analysis will not be for exact timings, but for overlooked parametric dependen-cies that may make the code inherently non-scalable. As mentioned in § C.3.4, a point of emphasis will be tocheck for greater-than-linear dependency of performance or resource allocation with respect to the numberof processors, processes, and/or threads (“size parameters”). We propose two approaches (illustrated viaexamples):• Through symbolic and static analysis techniques, we will detect whether the size parameters governing

the code ever flow into loop iteration ranges, and other programmer-chosen variables. We will furtheranalyze the situation based on the loop iteration and nest, exploiting the work we have done in [89]on handling loops.

• Almost all concolic verifiers (e.g., [39]) employ heuristics for test-case selection. We will investigatethe generation of test suites during concolic testing that are capable of revealing code performacedeficiencies (“performance calibration”), exploiting the test generation heuristics we have explored in[31].

This approach will allow our analysis tools to exploit a common base of concolic testing methods for bothcorrectness and performance testing.

Selected Formal Approaches to Concurrency and HPC: The simplest structured testing approachis one based on perturbing schedules by inserting random delays between program statements (e.g., [154]).While this can often be effective for thread programs (e.g., [46]), it is highly wasteful for message passingprograms where most actions are independent. Certainly, testing systems architected around componentsin this manner is bound to be non-scalable.

Verification tools for shared memory concurrency are under active research in many groups (e.g., [59, 106,130, 44, 71, 69, 155]) including static analyzers, software model checkers, and runtime verifiers. While directmodel checking of HPC software [136] can play an important role in analyzing new or tricky, “stateless”dynamic verification [59, 55] has better prospects for scalability. Standard approaches in this regard includedefining the notions of independent (commuting, [45]) operations, and considering all the commutations ofindependent operations in an execution trace to be equivalent as per the happens-before [85] relation. Thisrelation was defined for MPI for the first time in our past work. We defined this in terms of the matches-before order, because in message passing programs, the individual steps are message matching actions (forease of reference, we continue to call this the “happens-before”). Our work shows that it is the ordering ofmessage matches across processes that induces a matches-before ordering among the MPI calls issued byeach process [151, 149, 143].

The advantage of deterministic MPI program analysis are discussed in [137] where it is shown thatbuffering can be safely added to deterministic MPI programs (we show in [148] the kinds of analysis neededfor deadlocks if we can’t assume determinism). There is a noticeable momentum in the community towardthe use of deterministic constructs wherever possible (e.g., [80]).

Data flow analysis of purely MPI programs was pioneered in [135]. Bronevetsky, one of our collaboratorsat LLNL, is currently developing more advanced flow analysis methods for MPI [37].

Recent projects have also developed formal analysis methods for other shared memory models such asCUDA [47] and parallel DMA engines based on the IBM Cell processor ([50]).

C.5.1 Feasibility Assessment

None of the surveyed works have considered the use of formal methods during design, and that too at ascale similar to Uintah that we are proposing. Clearly this is new and required work, but we have to be sure

Page C-13

PROJECT DESCRIPTION

that we are not proposing something unattainable. We provide some justifications for our proposal beingfeasible:• Wherever we have a choice between a technique that works up to 1000 lines of code and one that

scales up to a million or more lines of code, we have to choose the latter. Two examples:– As opposed to the use of comprehensive symbolic performance analysis [8], we choose a much

more scalable flow-based (static) analysis for parametric dependencies on size parameters andconcolic testing based performance calibration.

– As opposed to a full Symbolic Differencing approach [27], our work on differential verificationwill be guided by the deltas in the interface flow-graph specifications of components.

• Our past work covers many topics in concurrency, allowing us to gain experience, and also allowingto have a rich tool infrastructure for thread verification [158, 159], static analysis based on the LLNLRose Framework [147], static and symbolic analysis based on the LLVM Framework [31], verificationof hybrid Pthread/MPI programs [43], and formal methods for Multicore Communications API [132].

• Our publication profile and problem-scale selection has consistently appealed to practitioner-rich com-munities such as indicated by our publications [54, 7, 9, 150, 151] where we show dynamic formalverification of unmodified MPI programs on real cluster machines. This experience will guide us inalways choosing scalable and effective formal techniques.

C.5.2 Specific Formal Artifacts and Formal Analysis Tools to be Developed

The main activity in this project will be the development of analysis tools, using these tools to strengthenUintah components, and then reintegrating them. In this context, here are some of the actual tools andartifacts to be developed:

Year-1:• Formal semantics for task graphs• Formal description of Uintah schedulers• Suitable syntax for writing Operational Semantics (Murphi rule-style, LSC-style, etc.)• Formal aspect-oriented specification and compilation algorithm into run-time monitors• Concolic verifiers for data warehouse designs and schedulers

Year-2:• Formal tools to compare component interface behaviors for safe interchangeability through assertion-

based testing and monitoring• Differential verification methods that track scheduler and data warehouse design changes• Tool releases of differential verification tools

Years 3-4:• Runtime monitor generators• Concolic testing to generate parametric tests in the performance space• Tool evaluation on actual designs• Tool adaptation to new CPU, GPU/accelerator, and API-types.• Tool-hardening and tool releases.Our choice of techniques will be governed by practical applicability (scalability). A considerable amount

of effort will go into supporting nuances specific to each CPU/GPU/Accelerator and API, so that users havethe incentive to actually use our tools—not merely view them as curiosities.

C.6 Results from Prior NSF Support

Ganesh Gopalakrishnan

CNS-0509379: CSR-SMA: Toward Reliable and Efficient Message Passing Through Formal Analysis; $400,000,8-1-05 to 7-31-10 (1-yr extension). PI: Gopalakrishnan, co-PI: Robert M. Kirby.Under this award, the following students successfully finished their PhDs under the PI:• PhD: Palmer (07), Yang (09), Vakkalanka (10), Vo (11).• MS: Pervez (07);• BS: Sawaya (10), DeLisi (09).

Page C-14

PROJECT DESCRIPTION

Some publications from this award: [146, 145, 134, 144, 152, 149, 32, 153, 121, 110, 111, 93, 159].

Software including Demos, Releases: The following items of software were released (all pertaining to tran-sitioning formal dynamic verification tools into practice):• ISP [77] released under Berkeley license, with over 300 world-wide downloads. ISP runs under Win-

dows, Linux, and Mac OS-X, and has been tested against five MPI libraries (MPICH2, OpenMPI,Microsoft MPI, IBM’s MPI, and MVAPICH).

• GEM [56] is now an official part of the Eclipse Parallel Tools Platform release version 4.0• Inspect [76] has been released, and finds extensive use and further development at NEC Research,

Princeton.

Tutorials: In addition to what is mentioned elsewhere, we contributed half-day tutorials on ISP and Inspectat the ICS May’09, [62]), FM09 (Nov’09, [61]), FM08 (May’08, [65]), and PPoPP (Jan’10, [63]). Full-daytutorial on ISP at EuroPVM/MPI (Sep’09, [54]).

CCF-0811429: CPA-DA: Formal Methods for Multi-core Shared Memory Protocol Design. PI: Gopalakr-ishnan. $250,000, 8-1-08 to 7-31-11.

Students: PhD: Li (10). BS: Humphrey, Derrick, Williams. Major publications resulting from this awardare [93, 92, 88, 147, 66, 32, 73, 153]. Paper accepted at FSE, 2010 [91].

Software including Demos: Released PUG under Berkeley license, accompanied by thorough online docu-mentation at [157].

CCF 0903408: NSF/SRC: MCDA: Formal Analysis of Multicore Communication APIs and Applications.PI: Gopalakrishnan. $249,556, 8-1-09 to 7-31-12.Students: PhD student Sharma is estimated to finish 2011. MS students Meakin (10), Chiang (10), andJones (09) finished.

Major publications resulting from this award are [131, 102, 101, 66, 132, 133, 134].The XUM Multicore [156] architecture and its VHDL design has been released along with the GCC tool

chain and support information

Tutorials: Half-day tutorial on ISP at Supercomputing 2010, [7], and Full-day tutorial on ISP and DAMPI [150,151] at Supercomputing 2011, [9].

• CNS 1035658: “CPS: Medium: Safety-Oriented Hybrid Verification for Medical Robotics.” PI: Might,co-PIs: Hollerbach, Gopalakrishnan, Parker. $500,000, 10-1-10 to 9-30-12 has been awarded.

PhD student Peng Li (graduation date est. 2012) is being supported. Peng has created a concolic verifierfor GPU kernel programs. This is under submission at the Principles and Practices of Parallel Programming(PPoPP), 2012 (Reference [31]).

Martin Berzins: Dr. Berzins has been in the US since 2003. From 2003-2009 he was funded as part ofthe DOE ASC C-SAFE project at the level of about $200K per year.While transitioning to Utah Berzins has continued his UK grants funded by EPSRC until October 2006,see http://gow.epsrc.ac.uk/ViewPerson.aspx?PersonId=1303.

More recently, he is PI of Cyberinfastructure awards:OCI-0721659 SDCI:HPC Improvement and Release of the Uintah Computational Framework; $703,91612/1/2007 to 30/11/2011 (1-yr extension). PI: Berzins, Students:• PhD: Justin Luitjens 2010

Software Releases: The Uintah software has been formally released under this software programsee http://www.uintah.utah.edu

OCI 0905068 PetaApps: Petascale Simulation of Sympathetic Explosions. PI Berzins co PIs Wight andHarman) $999280 09/1/2009 to 31/09/2013.

Publications: Both Projects have led to the publications [181, 182, 183, 184, 186, 187, 165, 166].

Page C-15

http://gow.epsrc.ac.uk/ViewPerson.aspx?PersonId=1303.

http://www.uintah.utah.edu

PROJECT DESCRIPTION

C.7 Collaboration Plans

The collaboration aspect of this proposal involves two quite research groups inside inside the University ofUtah in two distinct academic Units, The School of Computing and the Scientific Computing and ImagingInstitute and so deserves clarification.

C.7.1 Roles for Project Participants

Specific deliverables for the proposed work are now identified with collaborative plans for their completion.PI Gopalakrishnan in the School of Computing direct expertise in both formal methods and their appli-

cation to parallel computing defines his area of responsibility as leading the formal methods-based modifi-cation of the Uintah software. This division of responsibility reflect the PIs long-standing formal methodsapproaches to parallel software such as the message passing MPI software.

Co-PI Berzins in the Scientific Computing and Imaging Institute and the School of Computing has ledthe effort to improve the scalability of the Uintah software over the last five years or so. This has involvedthe redesign of many internal components and algorithms of the code. His concern will thus be to ensurethat the formal approaches developed here have immediate practical application to both the software andhow it will be used on practical large scale problems. Berzins will also ensure that the external parallelcomputing resources are available for large scale experimentation. At present PI Berzins has 10M SUs in2011 for the NSF Teragrid (now XSEDE) machines such as Kraken and Ranger and 15M processor hourson the DOE Jaguar supercomputer. Additional large scale GPU and Intel Mic resources will be sought aspart of this process. The University of Utah houses a CUDA center of excellence physically situated in theSCI Institute and so has access to small and medium scale GPU resources.

The PI and co-PI are both academic members of the School of Computing and their offices are locatedimmediately adjacent to each other, thus making collaboration easier than would otherwise be the case.Outreach: The PI and co-PI will create opportunity to promote the research outcomes in the proposalby organizing tutorials and workshops in conjunction with international conferences and conventions. PIGopalakrishnan is responsible for educating the community in regards to formal methods applied to parallelcomputing.

Co-PI Berzins is responsible for advocating scalable parallel computing and will increasingly advocatefor the formal approach in design of next generation software leading to exascale.

The PIs intend to target this combination of formalism and large scale parallelism as the topic ofworkshops in parallel computing conferences such as SIAM Parallel Processing for Scientific Computation,IPDPS and Supercomputing, to name but three diverse examples of many possible instances.Education: Under the direction of both PI Gopalakrishnan and co-PI Berzins, there will be a concertedeffort to introduce formal approaches in to the parallel computing graduate class that Berzins teaches andto introduce issues related to large scale parallelism in the formal methods classes that Gopalakrishnanteaches.

C.7.2 Project Management

PI Gopalakrishnan has the lead managerial role and project direction. He will be responsible for coordinatingreporting, and ensuring that regular meetings are taking place to meet project outcomes. The main presentmechanism for the day-to-day management of the Uintah software is a meeting called ”Homebrew” thattakes place on Tuesday mornings at 10pm and has done so for a decade. This meeting is used to exploreand provide solutions to problems of common interest to those using the software. Remote national andinternational participants connect to Homebrew via video conferencing. There are also smaller meetingsassociated with individual projects and one such weekly meeting will take place for this project. Both PIGopalakrishnan and co-PI Berzins will be responsible for system integration to ensure that the addition ofthe formal approaches yields a whole that is greater than the sum of the two (formal and Uintah) parts.In part this will come about through each area influencing the other in that formal methods must leadto a better Uintah design and the sheer scale of the parallelism aimed at here will influence the broadapplicability of the formal approaches. The broad range of testing available with Uintah through nightly

Page C-16

PROJECT DESCRIPTION

regression testing and very broad applications testing will provide additional insight into how well the formalapproaches work.

Both the PI and co-PI will be on the committees of all students in the project and will work closely ona day-to-day basis.

C.7.3 Coordination Mechanisms for Scientific Integration

Yearly Workshops: The PI and co-PI s will organize workshops for students and researchers at leadingparallel computing conferences.Software and Data Management: Co-PI Berzins will host software repositories, wikis, mailing lists,discussion forums, and other IT infrastructure to coordinate research activities amongst the group. Thesewill be based on existing Uintah infrastructure and will include regular releases of the software.

Technology reports, white papers, etc. will be published immediately through the SCI Institute whichhas an extensive publications database at http://www.sci.utah.edu/scipubs.html and then also throughjournals and conference proceedings.

C.7.4 Budget Support for Collaboration and Coordination

The budget for the PI and co-PI includes resources for student and PI/co-PI travel to support efforts topromote research results in well known research conferences and conventions.

Page C-17

http://www.sci.utah.edu/scipubs.html

REFERENCES CITED

D References Cited

References

[1] Karissa Y. Sanbonmatsu. FSE 2010 Keynote Address. http://www.t6.lanl.gov/kys/

[2] Supercomputer TOP500 Rankings Released: Japans K Computer Receives A http://www.utk.edu/

tntoday/2011/06/20/supercomputer-top500-rankings-released-japans-computer-receives/

[3] Lawrence Livermore Prepares for 20 Petaflop Blue Gene/Q. http://www.hpcwire.com/hpcwire/

2009-02-03/lawrence_livermore_prepares_for_20_petaflop_blue_gene_q.html

[4] Exascale computing study report. http://users.ece.gatech.edu/~mrichard/

ExascaleComputingStudyReports/ECS_reports.htm

[5] Intel Many Integrated Core Architecture. http://www.intel.com/

content/www/us/en/architecture-and-technology/many-integrated-core/

intel-many-integrated-core-architecture.html

[6] Thomas Cheatham, Amr Fahmy, Dan C. Stefanescu, and Leslie G. Valiant. Bulk Synchronous ParallelComputing – A Paradigm for Transportable Software. Hawaii Intl. Conf. on System Sciences, 1995.

[7] Ganesh Gopalakrishnan, Matthias Muller and Bronis de Supinski. Scalable Dynamic Formal Verifi-cation and Correctness Checking of MPI Applications. Half-day Tutorial at Supercomputing 2010.

[8] Sumit Gulwani. The SPEED tool, and Symbolic Space/Time Bound Analysis. http://research.

microsoft.com/en-us/um/people/sumitg/pubs/speed.html

[9] Matthias Muller, Bronis de Supinski, Ganesh Gopalakrishnan, Tobias Hilbrich, and David Lecomber.Dealing with MPI Bugs at Scale: Best Practices, Automatic Detection, Debugging, and Formal Veri-fication. Full-day Tutorial at Supercomputing, November 2011.

[10] “Total View Concurrency Tool,” http://www.totalviewtech.com.

[11] Tobias Hilbrich, Martin Schulz, Bronis R. de Supinski, and Matthias S. Muller. MUST: A ScalableApproach to Runtime Error Detection in MPI Programs. Chapter 5. M.S. Muller et al. (eds.), Toolsfor High Performance Computing 2009, 53 DOI 10.1007/978-3-642-11261-4 5.

[12] Suvas Vajracharya, Steve Karmesin, Peter Beckman, James Crotinger, Allen Malony, SameerShende, Rod Oldehoeft, and Stephen Smith. SMARTS: Exploiting Temporal Locality and Paral-lelism through Vertical Execution. http://uuu.enseirb.fr/~pelegrin/enseignement/enseirb/

archsys/documents/smarts_paper.ps.gz

[13] S.Amarasinghe,D.Campbell,W.Carlson,A.Chien,W.Dally,E.Elnohazy,M.Hall,R.Harrison, W. Harrod,K.Hill, J. Hiller, S. Karp, C. Koelbel, D.Koester, P. Kogge, J.Levesque, D. Reed, V. Sarkar,R.Schreiber, M. Richards, A. Scarpelli, J.Shalf, A.Snavely, and T. Sterling. Exascale computingstudy: Technology challenges in achieving exascale systems. Technical Report ECSS Report 101909,Georgia Institute of Technology, 2009.

[14] Goodale, T., Allen, G., Lanfermann, G., Mass, J., Seidel, E., and Shalf, J. The Cactus Frameworkand Toolkit: Design and applications. In In Vector and Parallel Processing - VECPAR 2002, 5thInternational Conference (2003), vol. 2565, Springer, pp. 1536.

[15] Feo, J., Cann, D., and Oldehoeft, R. A report on the SISAL language project. Journal of Paralleland Distributed Computing 10 (1990), 349366.

[16] Kale, L. V., and Krishnan, S. Charm++: Parallel programming with message-driven objects. InParallel Programming using C++, G. V. Wilson and P. Lu, Eds. MIT Press, Cambridge, MA, USA,1996, pp. 175213.

Page D-1

http://www.t6.lanl.gov/kys/

http://www.utk.edu/tntoday/2011/06/20/supercomputer-top500-rankings-released-japans-computer-receives/

http://www.utk.edu/tntoday/2011/06/20/supercomputer-top500-rankings-released-japans-computer-receives/

http://www.hpcwire.com/hpcwire/2009-02-03/lawrence_livermore_prepares_for_20_petaflop_blue_gene_q.html

http://www.hpcwire.com/hpcwire/2009-02-03/lawrence_livermore_prepares_for_20_petaflop_blue_gene_q.html

http://users.ece.gatech.edu/~mrichard/ExascaleComputingStudyReports/ECS_reports.htm

http://users.ece.gatech.edu/~mrichard/ExascaleComputingStudyReports/ECS_reports.htm

http://www.intel.com/content/www/us/en/architecture-and-technology/many-integrated-core/intel-many-integrated-core-architecture.html



http://research.microsoft.com/en-us/um/people/sumitg/pubs/speed.html

http://research.microsoft.com/en-us/um/people/sumitg/pubs/speed.html

http://www.totalviewtech.com

http://uuu.enseirb.fr/~pelegrin/enseignement/enseirb/archsys/documents/smarts_paper.ps.gz

http://uuu.enseirb.fr/~pelegrin/enseignement/enseirb/archsys/documents/smarts_paper.ps.gz

REFERENCES CITED

[17] Buttari, A., Langou, J., Kurzak, J., and Dongarra, J. A class of parallel tiled linear algebra algorithmsfor multicore architectures. Parallel Computing 35, 1 (2009), 3853.

[18] Dinan, J., Krishnamoorthy, S., Brian, L., Nieplocha, J., and Sadayappan, P. Scioto: A frameworkfor global-view task parallelism. In ICPP 08. 37th International Conference on Parallel Processing(Washington DC, USA, 2008), IEEE Computer Society, pp. 586593.

[19] Concurrent Collections. http://software.intel.com/en-us/articles/

intel-concurrent-collections-for-cc/

[20] Allinea DDT. http://www.allinea.com/ddt

[21] Jaguar Machine. http://www.olcf.ornl.gov/tag/jaguar/

[22] Aspect Oriented Programming. http://msdn.microsoft.com/en-us/library/aa288717(v=vs.71).aspx

[23] MPI 3.0 Standardization Effort. http://meetings.mpi-forum.org/MPI_3.0_main_page.php

[24] Rajeev Thakur and William Gropp, Test Suite for Evaluating Performance of MPI ImplementationsThat Support MPI THREAD MULTIPLE. in Proc. of the 14th European PVM/MPI Users’ GroupMeeting (Euro PVM/MPI 2007), September 2007, pp. 46-55. (pdf) (outstanding paper).

[25] Deian Tabakov and Moshe Y. Vardi. Optimized temporal monitors for SystemC. Proceeding RV’10Proceedings of the First international conference on Runtime verification Springer-Verlag Berlin.ISBN:3-642-16611-3 978-3-642-16611-2.

[26] Clem Cole and Russell Williams. Photoshop Scalability: Keeping It Simple. CACM Volume 53,Number 10, October 2010. http://queue.acm.org/detail.cfm?id=1858330

[27] SymDiff: Static semantic diff. http://research.microsoft.com/en-us/projects/symdiff/

[28] Max Grossman, Alina Simion, Zoran Budimlic, Vivek Sarkar. CnC-CUDA: Declarative Programmingfor GPU’s. 2010 Workshop on Languages and Compilers for Parallel Computing (LCPC), October2010.

[29] R. Alur. Timed Automata. NATO-ASI 1998 Summer School on Verification of Digital and Hy-brid Systems. 11th International Conference on Computer-Aided Verification, LNCS 1633, pp. 8-22,Springer-Verlag, 1999.

[30] Yasmina Abdeddaim , Abdelkarim Kerbaa, and Oded Maler. Task Graph Scheduling using TimedAutomata (2003). Proc. FMPPTA’03.

[31] Li, P., Li, G., Gopalakrishnan, Sawaya, G., G., Rajan, S.P., and Ghosh, I., GKLEE: ConcolicVerification and Test Generation for GPUs Submitted to the Principles and Practices of ParallelProgramming (PPoPP), (2012).

[32] Aananthakrishnan, S., Delisi, M., Vakkalanka, S. S., Vo, A., Gopalakrishnan, G.,Kirby, R. M., and Thakur, R. How formal dynamic verification tools facilitate novel concur-rency visualizations. In EuroPVM/MPI (2009), pp. 261–270.

[33] Altekar, G., and Stoica, I. ODR: Output-deterministic replay for multicore debugging. In SOSP’09: Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles (New York,NY, USA, 2009), ACM, pp. 193–206.

[34] http://www.mcs.anl.gov/events/2010-09-13-p2s2-panel-pavan-balaji.pdf. P2S2 2010 PanelDiscussion, “Is Hybrid Programming a Bad Idea Whose Time Has Come?”.

[35] Borkar, S., and Chien, A. The future of microprocessors. Communications of the ACM (May2011).

Page D-2

http://software.intel.com/en-us/articles/intel-concurrent-collections-for-cc/

http://software.intel.com/en-us/articles/intel-concurrent-collections-for-cc/

http://www.allinea.com/ddt

http://www.olcf.ornl.gov/tag/jaguar/

http://msdn.microsoft.com/en-us/library/aa288717(v=vs.71).aspx

http://msdn.microsoft.com/en-us/library/aa288717(v=vs.71).aspx

http://meetings.mpi-forum.org/MPI_3.0_main_page.php

http://queue.acm.org/detail.cfm?id=1858330

http://research.microsoft.com/en-us/projects/symdiff/

http://www.mcs.anl.gov/events/2010-09-13-p2s2-panel-pavan-balaji.pdf

REFERENCES CITED

[36] The Future of Computing Performance: Game over or Next Level? http://books.nap.edu/

openbook.php?record_id=12980&page=R1

[37] Bronevetsky, G. Communication-sensitive static dataflow for parallel message passing applications.In CGO (2009), pp. 1–12.

[38] Butenhof, D. R. Programming with POSIX Threads. Addison-Wesley, 2006.

[39] Cadar, C., Dunbar, D., and Engler, D. R. Klee: Unassisted and automatic generation ofhigh-coverage tests for complex systems programs. In OSDI (2008), pp. 209–224.

[40] KLEE Open Projects. http://klee.llvm.org/OpenProjects.html

[41] Cadar, C., Ganesh, V., Pawlowski, P. M., Dill, D. L., and Engler, D. R. Exe: Automati-cally generating inputs of death. In Proceedings of the 13th ACM Conference on Computer and Com-munications Security (CCS) (2006). http://www.stanford.edu/~cristic/papers/exe-tissec.

[42] Cadar, C., Godefroid, P., Khurshid, S., Pasareanu, C. S., Sen, K., Tillmann, N., andVisser, W. Symbolic execution for software testing in practice: preliminary assessment. In ICSE(2011), pp. 1066–1071.

[43] Chiang, W.-F., Szubzda, G., Gopalakrishnan, G., and Thakur, R. Dynamic verification ofhybrid programs. In EuroMPI (2010). To appear as a poster paper.

[44] Chipounov, V., Kuznetsov, V., and Candea, G. S2E: a platform for in-vivo multi-path analysisof software systems. In ASPLOS (2011), pp. 265–278.

[45] Clarke, E. M., Grumberg, O., and Peled, D. A. Model Checking. MIT Press, 2000.

[46] Copty, S., and Ur, S. Toward automatic concurrent debugging via minimal program mutantgeneration with AspectJ. Electr. Notes Theor. Comput. Sci. 174, 9 (2007), 151–165.

[47] Compute Unified Device Architecture (CUDA). http://www.nvidia.com/object/cuda_get.html.

[48] The cudaMPI Library. http://www.cs.uaf.edu/sw/cudaMPI/.

[49] http://www.nersc.gov/nusers/systems/dirac/programming.php. The NERSC Dirac Cluster.

[50] Donaldson, A., Kroening, D., and Rummer, P. Automatic analysis of DMA races using modelchecking and k-induction. Formal Methods in System Design (FMSD) 39, 1 (2011), 83–113.

[51] http://www.hpcx.ac.uk/support/FAQ/eager.html.

[52] Engler, D. R., and Ashcraft, K. Racerx: effective, static detection of race conditions anddeadlocks. In SOSP (2003), pp. 237–252.

[53] Scalapack user’s guide. J. J. Dongarra, Ed., SIAM. ISBN:0-89871-397-8.

[54] Invited full-day tutorial: Practical formal verification of mpi and thread programs, Nov 2009. Eu-roPVM/MPI, Espoo.

[55] Flanagan, C., and Godefroid, P. Dynamic partial-order reduction for model checking soft-ware. In POPL ’05: Proceedings of the 32nd ACM SIGPLAN-SIGACT Symposium on Principles ofProgramming Languages (2005), pp. 110–121.

[56] Release page for GEM: Graphical Explorer for MPI Programs. http://www.cs.utah.edu/fv/

ISP-eclipse.

[57] Germain, J. D. d. S., McCorquodale, J., Parker, S. G., and Johnson, C. R. Uintah: Amassively parallel problem solving environment. In HPDC’00: Ninth IEEE International Symposiumon High Performance and Distributed Computing (Aug. 2000).

Page D-3

http://books.nap.edu/openbook.php?record_id=12980&page=R1

http://books.nap.edu/openbook.php?record_id=12980&page=R1

http://klee.llvm.org/OpenProjects.html

http://www.stanford.edu/~cristic/papers/exe-tissec

http://www.nvidia.com/object/cuda_get.html

http://www.cs.uaf.edu/sw/cudaMPI/

http://www.nersc.gov/nusers/systems/dirac/programming.php

http://www.hpcx.ac.uk/support/FAQ/eager.html

http://www.cs.utah.edu/fv/ISP-eclipse

http://www.cs.utah.edu/fv/ISP-eclipse

REFERENCES CITED

[58] Gil, Y., Gonzalez-Calero, P. A., and Deelman, E. On the black art of designing computationalworkflows. In Proceedings of the Second Workshop on Workflows in Support of Large-Scale Science(WORKS’07) (2007).

[59] Godefroid, P. Model checking for programming languages using VeriSoft. In Proceedings of the24th ACM SIGPLAN-SIGACT symposium on Principles of Programming Languages (New York, NY,USA, 1997), POPL ’97, ACM, pp. 174–186.

[60] Godefroid, P., Klarlund, N., and Sen, K. Dart: directed automated random testing. In PLDI(2005), pp. 213–223.

[61] Gopalakrishnan, G. Tutorial 03: Practical mpi and pthread dynamic verification, Nov. 2009. FM2009, Eindhoven.

[62] Gopalakrishnan, G. Tutorial t8: Practical formal verification of mpi and thread programs, June2009. ICS, Yorktown Heights.

[63] Gopalakrishnan, G. Tutorial: Dynamic verification of message passing and threading, Jan. 2010.PPoPP, Bangalore.

[64] Gopalakrishnan, G., Kirby, R. M., Siegel, S., Thakur, R., Gropp, W., Lusk, E., de Supin-ski, B. R., Schulz, M., and Bronevetsky, G. Formal analysis of mpi-based parallel programs:Present and future. To appear in Communications of the ACM, December 2011.

[65] Gopalakrishnan, G., and Yang, Y., May 2008. Tutorial: Runtime Model Checking of Mul-tithreaded C Programs using Automated Instrumentation Dynamic Partial Order Reduction andDistributed Checking (half day) (T6), Turku.

[66] Gopalakrishnan, G., Yang, Y., Vakkalanka, S., Vo, A., Aananthakrishnan, S., Szubzda,G., Sawaya, G., Williams, J., Sharma, S., DeLisi, M., and Atzeni, S. Some resources forteaching concurrency. In Proceedings of the 7th Workshop on Parallel and Distributed Systems: Test-ing, Analysis, and Debugging (2009).

[67] Gropp, W., Lusk, E., and Skjellum, A. Using MPI: portable parallel programming with theMessage-Passing Interface. MIT Press, 1999.

[68] Gupta, S. Building Exascale GPU-Based Supercomputers. In Supercomputing (2010). SESSION:Heterogeneous Computing II.

[69] Havelund, K., and Pressburger, T. Model checking Java programs using Java PathFinder. Intl.J. on Software Tools for Technology Transfer 2, 4 (Apr. 2000).

[70] Junfeng Yang, Tisheng Chen, Ming Wu, Zhilei Xu, Xuezheng Liu, Haoxiang Lin, Mao Yang, FanLong, Lintao Zhang, Lidong Zhou. MODIST: Transparent Model Checking of Unmodified DistributedSystems. Proceedings of the Sixth Symposium on Networked Systems Design and Implementation(NSDI ’09), April, 2009.

[71] Holzmann, G. J. The Spin Model Checker. Addison-Wesley, Boston, 2004.

[72] Hosabettu, R., Gopalakrishnan, G., and Srivas, M. Formal verification of a complex pipelinedprocessor. Formal Methods in System Design 23, 2 (Sept. 2003), 171–213.

[73] Humphrey, A., Derrick, C., Gopalakrishnan, G., and Tibbitts, B. R. GEM: Graphicalexplorer for MPI programs. In Parallel Software Tools and Tool Infrastructures (ICPP workshop)(2010). http://www.cs.utah.edu/fv/GEM.

[74] http://www.cs.unb.ca/acrl/training/general/ibm_parallel_programming/pgm3.PDF.

[75] International exascale software project. http://www.exascale.org/iesp/Main_Page.

Page D-4

http://www.cs.utah.edu/fv/GEM

http://www.cs.unb.ca/acrl/training/general/ibm_parallel_programming/pgm3.PDF

http://www.exascale.org/iesp/Main_Page

REFERENCES CITED

[76] Release information for inspect. http://www.cs.utah.edu/fv.

[77] Release page for isp. http://www.cs.utah.edu/fv/ISP-release.

[78] Jacobsen, D. A., Thibault, J. C., and Senocak, I. An MPI-CUDA Implementation for MassivelyParallel Incompressible Flow Computations on Multi-GPU Clusters. In 48th AIAA Aerospace SciencesMeeting and Exhibit, Orlando, FL (Jan. 2010).

[79] K. Sen, D. Marinov, and G. Agha, “CUTE: a concolic unit testing engine for C,” in 10th ESEC/FSE,2005.

[80] R. L. B., Heumann, Jr., S., Honarmand, N., Adve, S. V., Adve, V. S., Welc, A., andShpeisman, T. Safe nondeterminism in a deterministic-by-default parallel language. In POPL (2011),pp. 535–548.

[81] Karmani, R., and Parthasarathy, M. A contract language for race-freedom, 2010. Workshop onExploiting Concurrency Efficiently and Correctly (EC2).

[82] Karunadasa, N., and Ranasinghe, D. Accelerating High Performance Applications with CUDAand MPI. In Industrial and Information Systems (ICIIS) (2009), pp. 331–336. Print ISBN: 978-1-4244-4836-4.

[83] Kirk, D. B., and mei W. Hwu, W. Programming Massively Parallel Processors. Morgan Kauffman,2010.

[84] Kumar, V. S., Rutt, B., Kurc, T., Catalyurek, U., Chow, S., Lamont, S., and Saltz, J.Large image correction and warping in a cluster environment. In Proceedings of SC’06 (Nov. 2006).

[85] Lamport, L. Time, clocks, and the ordering of events in a distributed system. Commun. ACM 21,7 (1978), 558–565.

[86] Lastovetsky, A., Kechadi, T., and Dongarra, J., Eds. Recent Advances in Parallel VirtualMachine and Message Passing Interface, 15th European PVM/MPI User’s Group Meeting, Proceedings(2008), vol. 5205 of LNCS, Springer.

[87] Zvonimir Rakamaric. STORM: Static Unit Checking of Concurrent Programs. ICSE 2010.

[88] Li, G., DeLisi, M., Gopalakrishnan, G., and Kirby, R. M. Formal specification of the MPI-2.0standard in tla+. In Principles and Practices of Parallel Programming (PPoPP) (2008), pp. 283–284.

[89] Li, G., and Gopalakrishnan, G. Scalable SMT-based verification of GPU kernel functions. InFoundations of Software Engineering (2010). Accepted. http://www.cs.utah.edu/fv/PUG.

[90] The Satisfiability Modulo Theories Competition. www.smtcomp.org

[91] Li, G., and Gopalakrishnan, G. Scalable smt-based verification of gpu kernel functions. InFoundations of Software Engineering (2010). Accepted.

[92] Li, G., Gopalakrishnan, G., Kirby, R. M., and Quinlan, D. Pug: A symbolic verifier ofgpu programs. Paper downloadable from http://www.cs.utah.edu/formal_verification/pdf/

PUG-Paper-PPoPP-Workshop.pdf.

[93] Li, G., Palmer, R., DeLisi, M., Gopalakrishnan, G., and Kirby, R. M. Formal specificationof MPI 2.0: Case study in specifying a practical concurrent programming api. Science of ComputerProgramming (2009).

[94] Zoran Budimlic, Michael Burke, Kathleen Knobe, Ryan Newton, David Peixotto, Vivek Sarkar, EdwinWestbrook. Deterministic Reductions in an Asynchronous Parallel Language. The 2nd Workshop onDeterminism and Correctness in Parallel Programming (WoDet), March 2011.

Page D-5

http://www.cs.utah.edu/fv

http://www.cs.utah.edu/fv/ISP-release

http://www.cs.utah.edu/fv/PUG

www.smtcomp.org

http://www.cs.utah.edu/formal_verification/pdf/PUG-Paper-PPoPP-Workshop.pdf

http://www.cs.utah.edu/formal_verification/pdf/PUG-Paper-PPoPP-Workshop.pdf

REFERENCES CITED

[95] Leslie Lamport. The Win32 Threads API Specification (1996). http://research.microsoft.com/

users/lamport/tla/threads/threads.html

[96] http://www.ncsa.illinois.edu/UserInfo/Resources/Hardware/Intel64TeslaCluster/. TheNCSA Lincoln Cluster at UIUC.

[97] Lusk, R., Pieper, S., Butler, R., and Chan, A. The asynchronous dynamic load-balancinglibrary. http://unedf.org/contents/talks/Lusk-ADLB.pdfAccessed 12/16/09. Powerpoint Pre-sentation from the Mathematics and Computer Science Division and the Nuclear Physics Division ofArgonne.

[98] Lv, J., Li, G., Humphrey, A., and Gopalakrishnan, G. Performance degradation analysis ofgpu kernels, 2011. EC2 Workshop : http://www.cs.utah.edu/cav2011.

[99] Manohar, R., and Martin, A. J. Slack elasticity in concurrent computing. In Proceedings ofthe Fourth International Conference on the Mathematics of Program Construction (1998), Springer-Verlag, pp. 272–285. LNCS 1422.

[100] Mazurkiewicz, A. Basic notions of trace theory. In Linear Time, Branching Time, and PartialOrder in Logics and Models for Concurrency (1989), vol. 354, Springer Verlag, Lecture Notes inComputer Science.

[101] Meakin, B., Multicore System Design with XUM: the Extensible Utah Multicore Project. MS thesis,University of Utah, School of Computing, April 2010.

[102] Meakin, B., and Gopalakrishnan, G. Hardwaredesign, synthesis, and verification of a multicorecommunication api. In SRC Techcon, Austin, TX (Sept. 2009).

[103] Mellor-Crummey, J. On-the-fly Detection of Data Races for Programs with Nested Fork-Join. InSupercomputing (1991), pp. 24–33.

[104] Message Passing Interface Forum. MPI: A Message-Passing Interface Standard, version 2.2,September 4, 2009, 2009.

[105] mpiblast: Open-source parallel blast. http://www.mpiblast.org Accessed 12/16/09.

[106] Musuvathi, M., Qadeer, S., and Ball, T. Chess: Systematic stress testing of concurrent software.In Logic-Based Program Synthesis and Transformation (2007), vol. 4407/2007.

[107] Fermi. http://www.nvidia.com/object/fermiarchitecture.html.

[108] OpenCL: Open Computing Language. http://www.khronos.org/opencl.

[109] Pacheco, P. Parallel Programming with MPI. Morgan Kaufmann, 1996. ISBN 1-55860-339-5.

[110] Palmer, R., Delisi, M., Gopalakrishnan, G., and Kirby, R. M. An approach to formalizationand analysis of message passing libraries. In Formal Methods for Industry Critical Systems (FMICS2007) (2008), S. Leue and P. Merino, Eds., pp. 164–181. LNCS 4916, EASST Best Paper Award.

[111] Palmer, R., Gopalakrishnan, G., and Kirby, R. M. Semantics driven dynamic partial-orderreduction of mpi-based parallel programs. In Parallel and Distributed Systems: Testing and Debugging(PADTAD - V) (2007), pp. 43–53. Best Paper Award.

[112] Park, J., Diniz, P., and Raghunathan, S. Area and performance modeling of complete fpgadesigns in the presence of loop transformations. IEEE Transactions on Computers Special Issue onField Programmable Logic (2004).

[113] ParMETIS - Parallel graph partitioning and fill-reducing matrix ordering. http://glaros.dtc.umn.edu/gkhome/metis/parmetis/overview.

Page D-6

http://research.microsoft.com/users/lamport/tla/threads/threads.html

http://research.microsoft.com/users/lamport/tla/threads/threads.html

http://www.ncsa.illinois.edu/UserInfo/Resources/Hardware/Intel64TeslaCluster/

http://unedf.org/contents/talks/Lusk-ADLB.pdf

http://www.mpiblast.org

http://www.nvidia.com/object/fermi architecture.html

http://www.khronos.org/opencl

http://glaros.dtc.umn.edu/gkhome/metis/parmetis/overview

http://glaros.dtc.umn.edu/gkhome/metis/parmetis/overview

REFERENCES CITED

[114] The Murphi Explicit State Model Checker. http://sprout.stanford.edu/dill/murphi.html. Utahdistributions at http://www.cs.utah.edu/fv/Murphi. Utah’s Parallel Murphi at http://www.cs.

utah.edu/formal_verification/mediawiki/index.php/Eddy_Murphi.

[115] The Maude System. http://maude.cs.uiuc.edu/

[116] Annette Bunker and Ganesh Gopalakrishnan. Using Live Sequence Charts for Hardware ProtocolSpecification and Compliance Verification. IEEE International High Level Design Validation and TestWorkshop (HLDVT) 2001. pages = 95-100. http://www.cs.utah.edu/~abunker/pubs/hldvt01.ps

[117] Annette Bunker, Ganesh Gopalakrishnan, and Sally A. McKee. Formal Hardware Specification Lan-guages for Protocol Compliance Verification. ACM Transactions on Design Automation of ElectronicSystems, vol. 9, no. 1, January 2004.

[118] Werner Damm and David Harel. LSCs: Breathing Life into Message Sequence Charts. Formal Methodsin System Design, 2001.

[119] Pennycook, S., Hammond, S., Jarvis, S., and Mudalige, G. Performance Analysis of a HyrbidMPI/CUDA Implementation of the NAS-LU Benchmark. In Supercomputing (2010).

[120] Person, S., Dwyer, M. B., Elbaum, S., and Psreanu, C. S. Differential symbolic execution. InProceeding SIGSOFT ’08/FSE-16 Proceedings of the 16th ACM SIGSOFT International Symposiumon Foundations of software engineering (2008). ISBN: 978-1-59593-995-1.

[121] Pervez, S., Gopalakrishnan, G., Kirby, R., Thakur, R., and Gropp, W. Formal methods ap-plied to high-performance computing software design : a case study of mpi one-sided communication-based locking. Software: Practice and Experience (2009). Accepted.

[122] The Eclipse Parallel Tools Platform. http://www.eclipse.org/ptp.

[123] Qu, D., Roychoudhury, A., Lang, Z., and Vaswani, K. Darwin: An Approach for Debug-ging Evolving Programs. In Proceedings of the Symposium on Foundations of Software Engineering(ESEC/FSE) (Sept. 2009).

[124] Rivers, J. A., Gupta, M. S., Shin, J., Kudva, P., and Bose, P. Error tolerance in server classprocessors. IEEE Trans. on CAD of Integrated Circuits and Systems 30, 7 (2011), 945–959.

[125] ROSE web page. http://www.rosecompiler.org/, 2008.

[126] Sachdev, G. S., Sudan, K., Hall, M. W., and Balasubramonian, R. Data locality optimiza-tion of pthread applications for non-uniform cache architectures. PACT 2011 poster.

[127] Santhanaraman, G., Balaji, P., Gopalakrishnan, K., Thakur, R., Gropp, W., and Panda,D. K. Natively supporting true one-sided communication in mpi on multi-core systems with infiniband.In Proceedings of the IEEE International Symposium on Cluster Computing and the Grid (CCGrid)(2009).

[128] Sarkar, V., Harrod, W., and Snavely, A. Software challenges in extreme scale systems. SciDACReview Special Issue on Advanced Computing: The Roadmap to Exascale (Jan. 2010), 60–65.

[129] Schulz, M., and de Supinski, B. R. Pnmpi tools: a whole lot greater than the sum of their parts.In SC (2007), p. 30.

[130] Sen, K. Race directed random testing of concurrent programs. In PLDI (2008), pp. 11–21.

[131] Sharma, S., and Gopalakrishnan, G. Formal verification of mcapi applications using the dynamicverification tool mcc. In SRC Techcon, Austin, TX (Sept. 2009). Chosen as the Best VerificationSession Paper.

Page D-7

http://sprout.stanford.edu/dill/murphi.html

http://www.cs.utah.edu/fv/Murphi

http://www.cs.utah.edu/formal_verification/mediawiki/index.php/Eddy_Murphi

http://www.cs.utah.edu/formal_verification/mediawiki/index.php/Eddy_Murphi

http://maude.cs.uiuc.edu/

http://www.cs.utah.edu/~abunker/pubs/hldvt01.ps

http://www.eclipse.org/ptp

http://www.rosecompiler.org/

REFERENCES CITED

[132] Sharma, S., Gopalakrishnan, G., and Mercer, E. Dynamic verification of multicore commu-nication applications in MCAPI. In International High-Level Design, Validation and Test Workshop(HLDVT) (Nov. 2009), IEEE.

[133] Sharma, S., Gopalakrishnan, G., Mercer, E., and Holt, J. MCC - a runtime verification toolfor MCAPI user applications. In 9th International Conference Formal Methods in Computer AidedDesign (FMCAD) (Nov. 2009), IEEE, pp. 41–44.

[134] Sharma, S., Vakkalanka, S., Gopalakrishnan, G., Kirby, R. M., Thakur, R., and Gropp,W. A formal approach to detect functionally irrelevant barriers in MPI programs. In Lastovetskyet al. [86].

[135] Michelle Strout and Paul Hovland. Representation-independent program analysis. PASTE’2005.pp.67-74.

[136] Siegel, S. F., and Avrunin, G. S. Verification of halting properties for MPI programs using non-blocking operations. In Recent Advances in Parallel Virtual Machine and Message Passing Interface,14th European PVM/MPI User’s Group Meeting, Paris, France, September 30 - October 3, 2007,Proceedings (2007), F. Cappello, T. Hrault, and J. Dongarra, Eds., vol. 4757 of Lecture Notes inComputer Science, Springer, pp. 326–334.

[137] Siegel, S. F., and Gopalakrishnan, G. Formal Analysis of Message Passing. In Verification,Model Checking, and Abstract Interpretation: 12th International Conference, VMCAI 2011 (2007),R. Jhala and D. Schmidt, Eds., LNCS. Invited Tutorial.

[138] Software, O. http://www.openmp.org.

[139] Sun, L., Stoller, C., and Newhall, T. Hybrid MPI and GPU Approach to Efficiently SolvingLarge kNN Problems. In Teragrid (2010). Poster.

[140] http://www.sci.utah.edu/nvidia-coe.html. The Tess Cluster at the Nvidia CUDA Center ofExcellence, School of Computing, University of Utah.

[141] Thakur, R., Lai, P., Balaji, P., and Panda, D. K. Proone: A general purpose protocol onloadengine for multi- and many-core architectures. In Special edition of the Springer Journal of ComputerScience on Research and Development (2009).

[142] http://www.spacedaily.com/reports/New_Research_Provides_Effective_Battle_Planning_

For_Supercomputer_War_999.html. The Tianhe-1A and its use of MPI and CUDA.

[143] Vakkalanka, S. Efficient dynamic verification algorithms for MPI applications, 2010. Ph.D. Dis-sertation available from http://www.cs.utah.edu/fv.

[144] Vakkalanka, S., DeLisi, M., Gopalakrishnan, G., and Kirby, R. M. Scheduling considera-tions for building dynamic verification tools for MPI. In Parallel and Distributed Systems - Testingand Debugging (PADTAD-VI) (Seattle, WA, July 2008).

[145] Vakkalanka, S., DeLisi, M., Gopalakrishnan, G., Kirby, R. M., Thakur, R., and Gropp,W. Implementing efficient dynamic formal verification methods for MPI programs. In Lastovetskyet al. [86].

[146] Vakkalanka, S., Gopalakrishnan, G., and Kirby, R. M. Dynamic verification of MPI programswith reductions in presence of split operations and relaxed orderings. In CAV (2008), A. Guptaand S. Malik, Eds., vol. 5123 of Lecture Notes in Computer Science, Springer. Benchmarks are athttp://www.cs.utah.edu/formal_verification/ISP_Tests.

[147] Vakkalanka, S., Szubzda, G., Vo, A., Gopalakrishnan, G., Kirby, R. M., and Thakur, R.Static-analysis assisted dynamic verification of mpi waitany programs. In Recent Advances in ParallelVirtual Machine and Message Passing Interface (EuroPVM/MPI) (2009), pp. 329–330.

Page D-8

http://www.openmp.org

http://www.sci.utah.edu/nvidia-coe.html

http://www.spacedaily.com/reports/New_Research_Provides_Effective_Battle_Planning_For_Supercomputer_War_999.html

http://www.spacedaily.com/reports/New_Research_Provides_Effective_Battle_Planning_For_Supercomputer_War_999.html

http://www.cs.utah.edu/fv

http://www.cs.utah.edu/formal_verification/ISP_Tests

REFERENCES CITED

[148] Vakkalanka, S., Vo, A., Gopalakrishnan, G., and Kirby, R. Precise dynamic analysis for slackelasticity: Adding buffer without adding bugs. In Recent Advances in the Message Passing Interface,17th European MPI Users’ Group Meeting, EuroMPI 2010 (Sept. 2010), vol. 6305 of LNCS.

[149] Vakkalanka, S., Vo, A., Gopalakrishnan, G., and Kirby, R. M. Reduced execution semanticsof MPI: From theory to practice. In FM 2009 (Nov. 2009), pp. 724–740.

[150] Vo, A., Aananthakrishnan, S., Gopalakrishnan, G., de Supinski, B. R., Schulz, M.,and Bronevetsky, G. A scalable and distributed dynamic formal verifier for MPI programs. InSupercomputing (SC) (2010). http://www.cs.utah.edu/fv/DAMPI/ and http://www.cs.utah.edu/

fv/DAMPI/sc10.pdf.

[151] Vo, A., Gopalakrishnan, G., Kirby, R. M., de Supinski, B. R., Schulz, M., and Bron-evetsky, G. Large scale verification of mpi programs using lamport clocks with lazy update. InPACT (2011). Accepted.

[152] Vo, A., Vakkalanka, S., DeLisi, M., Gopalakrishnan, G., Kirby, R. M., , and Thakur, R.Formal verification of practical MPI programs. In Principles and Practices of Parallel Programming(PPoPP) (2009), pp. 261–269.

[153] Vo, A., Vakkalanka, S., Williams, J., Gopalakrishnan, G., Kirby, R. M., and Thakur,R. Sound and efficient dynamic verification of mpi programs with probe non-determinism. In Eu-roPVM/MPI (Sept. 2009), p. 271281.

[154] Vuduc, R., Schulz, M., Quinlan, D., de Supinski, B., and Saebjornsen, A. Improveddistributed memory applications testing by message perturbation. In Parallel and Distributed Systems:Testing and Debugging (PADTAD - IV) (2006).

[155] Wang, C., Chaudhuri, S., Gupta, A., and Yang, Y. Symbolic pruning of concurrent programexecutions. In ESEC/SIGSOFT FSE 2009 (2009), pp. 23–32.

[156] Download site for our XUM multicore design. http://www.cs.utah.edu/fv/XUM.

[157] Download site for our PUG tool (Prover of User GPU Programs). http://www.cs.utah.edu/fv/PUG.

[158] Yang, Y., Chen, X., Gopalakrishnan, G., and Kirby, R. M. Distributed dynamic partial orderreduction based verification of threaded software. In SPIN (2007), D. Bosnacki and S. Edelkamp, Eds.,vol. 4595 of Lecture Notes in Computer Science, Springer, pp. 58–75. Model Checking Software, 14thInternational SPIN Workshop, Berlin, Germany, July 1-3, 2007, Proceedings.

[159] Yang, Y., Chen, X., Gopalakrishnan, G., and Wang, C. Automatic detection of transitionsymmetry in multithreaded programs using dynamic analysis. In SPIN (June 2009), pp. 279–295.

[160] Young, B. D. MPI within a GPU. Master’s thesis, University of Kentucky, 2009.

[161] Yuichiro Ajima, Shinji Sumimoto, and Toshiyuki Shimizu. Tofu: A 6D Mesh/Torus Interconnect forExascale Computers. Computer 42, 11 (November 2009), 36-40.

[162] S.F. Ashby and May J.M. Multiphysics simulations and petascale computing. In Petascale ComputingAlgorithms and Applications. Chapman and Hall/CRC (Ed. D.Bader), 2007.

[163] S. Atlas, S. Banerjee, J.C. Cummings, P.J. Hinker, M. Srikant, J.V.W. Reynders, and M. Tholburn.A high-performance distributed simulation environment for scientific applications. In Proceedings ofSupercomputing. Supercomputing, 1995.

[164] A. H. Baker, R. D. Falgout, T. V. Kolev, and U. M. Yang Scaling Hypres multigrid solvers to 100,000cores to appear in High Performance Scientific Computing: Algorithms and Applications - A Tributeto Prof. Ahmed Sameh, M. Berry et al., eds., Springer. LLNL-JRNL-479591.

Page D-9

http://www.cs.utah.edu/fv/DAMPI/

http://www.cs.utah.edu/fv/DAMPI/sc10.pdf

http://www.cs.utah.edu/fv/DAMPI/sc10.pdf

http://www.cs.utah.edu/fv/XUM

http://www.cs.utah.edu/fv/PUG

REFERENCES CITED

[165] M. Berzins, J. Luitjens, Q.Meng, T.Harman, C.A.Wight and J. Peterson Uintah a scalable frameworkfor hazard analysis Proc. of 2010 Teragrid Conf. July 2010, ACM.

[166] M. Berzins, Q.Meng, J.Schmidt and J. Sutherland DAG-Based Software Frameworks for PDEs Proc.of HPSS 2011 (Europar Bordeaux August 2011), Springer (to appear)

[167] A.D. Brydon, S.G. Bardenhagen, E.A. Miller, and G.T. Seidler. Simulation of the densification of realopen–celled foam microstructures. J. Mech. Phys. Solids, 53:2638–2660, 2005.

[168] R.D. Falgout, J.E. Jones, and U.M. Yang, The Design and Implementation of Hypre, a Libraryof Parallel High Performance Preconditioners, chapter in Numerical Solution of Partial DifferentialEquations on Parallel Computers, A.M. Bruaset and A. Tveito, eds., Springer-Verlag, 51 (2006), pp.267-294. UCRL-JRNL-205459.

[169] B.Fryxell, K.Olson, P.Ricker, F.X.Timmes, M.Zingale, D.Q.Lamb, P.Macneice,R.Rosner,J.W. RosnerJ.W. Truran H.Tufo. FLASH an adaptive mesh hydrodynamics code for modeling astrophysicalthermonuclear flashes. The Astrophysical Journal Supplement Series 131 273-334 November 2000.

[170] J.E. Guilkey, T.B. Harman, and B. Banerjee. An Eulerian-Lagrangian approach for simulating explo-sions of energetic devices. Computers and Structures, 85:660–674, 2007.

[171] J.E. Guilkey, T.B. Harman, A. Xia, B.A Kashiwa, and P.A. McMurtry. An EUlerian-Lagrangianapproach for large deformation fluid-structure interaction problems, part 1: Algorithm development.In Fluid Structure Interaction II, Cadiz, Spain, 2003. WIT Press.

[172] J.E. Guilkey, J.B Hoying, and J.A. Weiss. Modeling of multicellular constructs with the materialpoint method. Journal of Biomechanics, 39:2074–2086, 2007.

[173] J. Guilkey, T. Harman, J. Luitjens, J. Schmidt, J. Thornock, J.D. de St. Germain, S. Shankar, J.Peterson, C. Brownlee. Uintah User Guide Version 1.1, SCI Technical Report, No. UUSCI-2009-007,SCI Institute, University of Utah, 2009.

[174] T.C. Henderson, P.A. McMurtry, P.G. Smith, G.A. Voth, C.A. Wight, and D.W. Pershing. Simulatingaccidental fires and explosions. Comp. Sci. Eng., 2:64–76, 1994.

[175] I. Ionescu, J.E. Guilkey, M. Berzins, R.M. Kirby, and J.A. Weiss. Simulation of soft tissue failureusing the material point method. Journal of Biomechanical Engineering, 128:917–924, 2006.

[176] L. V. Kale, E. Bohm, C. L. Mendes, T. Wilmarth, and G. Zheng. Programming petascale applicationswith Charm++ and AMPI. Petascale Computing: Algorithms and Applications, 1:421–441, 2007.

[177] B.A. Kashiwa and E.S. Gaffney. Design basis for CFDLIB. Technical Report LA-UR-03-1295, LosAlamos National Laboratory, Los Alamos, 2003.

[178] B.A. Kashiwa, M.L. Lewis, and T.L. Wilson. Fluid-structure interaction modeling. Technical ReportLA-13111-PR, Los Alamos National Laboratory, Los Alamos, 1996.

[179] G. Krishnamoorthy, S. Borodai, R. Rawat, J. P. Spinti, and P. J. Smith. Numerical modeling ofradiative heat transfer in pool fire simulations. ASME International Mechanical Engineering Congress(IMECE), Orlando, Florida, November 2005.

[180] J. Luitjens. The Scalability of Parallel Adaptive Mesh Refinement Within Uintah. Ph.D. Disserttion,University of Utah, May 2011.

[181] J. Luitjens, M. Berzins, and T.C. Henderson. Parallel space-filling curve generation. Concurrency andComputation Practice and Experience, 19:1387–1402, 2007.

[182] J. Luitjens, B. Worthen, M. Berzins, and T.C. Henderson. Scalable parallel AMR for the Uintahmulti-physics code. In Petascale Computing Algorithms and Applications. Chapman and Hall/CRC(Ed. D.Bader), 2007.

Page D-10

REFERENCES CITED

[183] J. Luitjens and M. Berzins. Improving the performance of Uintah: A large-scale adaptive meshingcomputational framework. In Proceedings of the 24th IEEE International Parallel and DistributedProcessing Symposium (IPDPS10), 2010.

[184] J. Luitjens and M. Berzins. Scalable parallel regridding algorithms for block-structured adaptivemesh refinement, Concurrency And Computation: Practice And Experience, 2011.

[185] B. O’Shea, G. Bryan, J. Bordner, M. Norman, T. Abel, R. Harkness, A. Kritsuk Introducing Enzo, anAMR Cosmology Application in ”Adaptive Mesh Refinement - Theory and Applications”, T. Plewa,T. Linde, V. G. Weirs (Eds.), Lecture Notes in Computational Science and Engineering, Vol. 41, pp.341-350, Springer 2005.

[186] Q. Meng, J. Luitjens, and M. Berzins. Dynamic task scheduling for the Uintah framework. InProceedings of the 3rd IEEE Workshop on Many-Task Computing on Grids and Supercomputers(MTAGS10), 2010.

[187] Q. Meng, M. Berzins and J.Schmidt. Using hybrid parallelism to improve memory use in the Uintahframework. In Proceedings of the Teragrid 2011 Conference, ACM (2011) to appear.

[188] S.G. Parker. A component-based architecture for parallel multi-physics PDE simulation. FutureGeneration Comput. Sys. 2006; 22(1):204-216.

[189] R.Rawat, J. Spinti, W.Yee , P.J. Smith Parallelization of a Large Scale Hydrocarbon Pool Fire in theUintah PSE Paper no. IMECE2002-33105 pp. 49-55 ASME 2002 International Mechanical EngineeringCongress and Exposition (IMECE2002) November 1722, 2002 , New Orleans, Louisiana, USA HeatTransfer, Volume 6 ISBN: 0-7918-3637-1

[190] P. J. Smith, R.Rawat, J. Spinti, S. Kumar S. Borodai, A. Violi. Large eddy simulation of accidentalfires using massively parallel computers. AIAA-2003-3697, 18th AIAA Computational Fluid DynamicsConference. Orlando Florida, June 2003.

[191] Spinti, J., Thornock, J., Eddings, E., Smith, P., and Sarofim, A. Heat Transfer to Objects in PoolFires, in Transport Phenomena in Fires, WIT Press, Southampton, U.K., 2008.

[192] D. Sulsky, Z. Chen, and H.L. Schreyer. A particle method for history dependent materials. Comput.Methods Appl. Mech. Engrg., 118:179–196, 1994.

[193] Van Rij, J.; Ameel, T.; Harman, T. ”The Effect of Viscous Dissipation and Rarefaction on RectangularMicrochannel Convective Heat Transfer,” International Journal of Thermal Sciences, 48, 271-281,2009.

[194] J.H. Wise, T. Abel Enzo+Moray: Radiation Hydrodynamics Adaptive mesh Refinement Simulationswith Adaptive Ray Tracing. arXiv:1012.2865v2, 15th March 2011.

[195] A.M. Wissink, R.D. Hornung, S.R. Kohn, S.S. Smith, and N. Elliott. Large scale parallel struc-tured AMR calculations using the SAMRAI framework. In Supercomputing ’01: Proc. of the 2001ACM/IEEE Conference on Supercomputing, New York, USA, 2001.

Page D-11

Project Personnel

E Project Personnel

Prof. Ganesh Gopalakrishnan, University of UtahProf. Martin Berzins, University of UtahMr. Alan Humphrey, University of UtahNew postdoctoral fellow to be recruited, and graduate students to be recruited.

Page E-1

Data Management Plan

Data Management Plan

In this proposed project data management issues will be addressed through the Scientific Computing andImaging (SCI) Institute as this is the home of the Uintah software (www.uintah.utah.edu) and associatedwikis etc.

Projects at the SCI Institute such as Uintah, benefit from over ten years of software dissemination anddata-housing experience. With respect to software dissemination, the institute offers roughly twenty open-source software packages that are downloaded by tens of thousands of users yearly. The SCI website offersinnumerable project communication tools, including unrestricted and restricted wiki pages and data reposi-tories. In addition, PI Gopalakrishnan’s group maintains a Formal Methods webpage (www.cs.utah.edu/fv)and also highlights research achievements in the Center for Parallel Computing at Utah webpage(www.parallel.utah.edu).

Project ResultsThe main results of our research will be published in conferences and journals, and will be made publicly

available through the aforesaid websites. In addition, results from the proposed study will consist of softwareand software documentation.

Data disseminationIt has been our goal at SCI to provide the online community with meaningful access to our research.

In addition to publishing in journals and electronic venues, we provide full access to our publications atconferences, including the preprints and technical reports released internally from SCI. Additionally, we willrelease research results as School of Computing technical reports.

A unique opportunity associated with the proposed work is that of releasing Uintah software versionswhose components have been rigorously verified, and also equipped with runtime monitoring facilities. Ourcode release will be through appropriate software licenses (e.g., MIT open-source license, Berkeley license,or GPL).

Backups and ArchiveSCI maintains three separate backup systems managed by a dedicated backup system. The backup server

controls three tape libraries and is connected to SCI infrastructure with 2x10GB network connections. Athree drive LTO-3 tape library for archiving data for off site storage with slots to hold a 50TB(compressed)single load tape rotation, all directories are fully archived once a quarter. A six drive LTO-4 for on sitedata backup with a 417TB(compressed) single load tape rotation, all directories are backed up fully once aquarter with nightly incrementals and by weekly synthetic differentials. A single drive LTO-4 tape librarywith 12TB(compressed) single load tape rotation capacity backups infrastructure systems such as DNS,DHCP, mail, and accounting data. We will additionally employ the School of Computing backup servicealso.

Data Backups of critical systemsSCI employs backup software for critical desktops, laptops, and servers. This software supports systems

on the central servers 16TB of disk; data is then transferred and stored on tape. In addition to scheduledbackups, the software provides continuous, real-time protection by detecting data changes. Once the infor-mation stabilizes, it is then stored to the central server or transmitted to a HIPAA-compliant data centerin Minneapolis, Minnesota.

Data PrivacyData access is primarily controlled with Unix groups, but for higher security levels such as HIPAA or

proprietary, restrictions are set at the directory level. This allows host-level (IP restricted) and networkaccess to a directory, which can be further restricted to host computers or servers through a dedicatedpoint-to-point network connection.

Physical access to the data center is limited to SCI Faculty and IT staff.

Page E-1

Facilities

Facilities, Equipment and Other Resources

SCI Institute The Scientific Computing and Imaging (SCI) Institute is one of eight recognized researchinstitutes at the University of Utah and includes twelve faculty members and over 100 other scientists,administrative support staff, and graduate and undergraduate students. SCI has over 25,000 square feetof functional space allocated to its research and computing activities within the new John and MarvaWarnock Engineering Building on campus. Laboratory facilities for the researchers listed in this projectinclude personal work spaces and access to common working areas.

Computing: The SCI computing facility, which has dedicated machine room space in the WarnockEngineering Building, includes: Shared memory multi-processor computers, clusters and dedicated graphicssystems.

1. 264 core, 2.8TB shared memory SGI UV 1000 system with Intel X7542 2.67GHz Processors

2. 64 node GP-GPU cluster. Each node has 8 cores, 24GB of RAM system with Intel X5550 2.66GHzprocessors each node is connected to (32) NVIDIA s1070 Tesla GPU systems nodes are linked with a4x DDR Infiniband backbone with dual 10G network connections to SCI core switches.

3. 32 core, 192GB shared memory IBM Linux system with Intel Xeon X7350 3.0GHz processors Thissystem can also be reconfigured into two separate 16 core systems with 96GB of RAM

4. 64 core. 512GB shared memory HP DL980 G7 with Intel Xeon X7560 2.27GHz processors, 2x NVIDIAm2070 GPU cards and 10Gb network connections.

5. (3) 8 processor (24 cores, 2.5GHz, AMD Opteron with Nvidia Quadro FX 5600 graphics card) with adual Gigabit Ethernet backbone and 96GB RAM

6. (3) 2 processor (8 cores, 2.67GHz Intel X5550 with Nvidia GeForce GTS 250 graphics card) with aGigabit Ethernet backbone and 48GB RAM

7. 4 processor (8 cores, 2.0GHz, AMD Opteron, with Nvidia Quadro 2FX graphics card) with a dualGigabit Ethernet backbone and 16GB RAM

8. (2) 16 core systems with multiple Nvidia c2070 Tesla cards

In addition, the SCI Institute computing facility contains:

1. A SGI InfiniteStorage system consisting of both Fiber Channel and SATA disk with a capacity of80TB

2. An Isilon storage cluster with 13 36NL 36TB storage nodes for a total of 422TB usable space with 6xdual 10 Gigabit Ethernet links and significant expansion capacity ( up to 1PB single namespace )

3. Dedicated IBM backup server to manage SAN backup system and SL500 robots. 16TB of local SATAdisk for disk to disk backups of critical systems

4. IBM 10TB LTO-4 tape library providing backup for infrastructure systems such as email, web, DNS,and administrative systems

5. 500TB LTO-4 StorageTek SL500 tape library primary backup system

6. 40TB LTO-3 StorageTek SL500 tape library for archiving and offsite storage

7. 2 fully redundant Foundry BigIron RX-16 switching cores that provides a Gigabit network backbonefor all HPC computers, servers, and individual workstations connected via Foundry floor switches

8. Connections to the campus backbone via redundant 10 Gigabit Ethernet links - the first such attach-ments on campus

Page E-1

Facilities

9. A variety of Intel and AMD based desktop workstations running Linux with the latest ATI or Nvidiagraphics cards

10. Numerous Windows XP and Windows 7 desktop workstation

11. Numerous MacPro workstations running OSX 10.6 with 30 inch displays

12. A 10 foot by 8 foot high-resolution rear projection system with stereo capability for collaborativegroup interaction and code development

13. Six Sun quad-core AMD Opteron high availability Linux servers providing core SCI IT services -website (www.sci.utah.edu), ftp, mail, and software distribution

14. Dedicated version control server with 6TB of local disk space for all SCI code and software projects

15. UPS power, including 100 minutes of battery backup for critical SCI servers

Office Space: The SCI Institute houses its faculty and staff in individual offices. Students have individualdesk space equipped with a workstation and located in large, open common areas that facilitate studentcollaboration and communication. All workstations are connected to the SCI local area network via full-duplex Gigabit Ethernet. The University of Utah is a member of the Internet2 advanced networkingconsortium. It is connected to the new Internet2 Network via Gigabit Ethernet (1 Gbit per second) initially.

University of Utah CHPC The University of Utah is a member of the Internet2 advanced networkingconsortium. It is connected to the new Internet2 Network via Gigabit Ethernet (1 Gbit per second) initially.The University also connects to National LambdaRail (NLR) via a dedicated link to the Front RangeGigapop and also has the ability to extend dedicated wavelengths into the campus from the Internet2 andNLR nodes colocated in a Level 3 Communications facility west of downtown Salt Lake City.

The Center for High Performance Computing (CHPC) at the University of Utah is responsible forproviding high-end computing services to advanced research groups working in the computational sciencesand with simulations. CHPC has core competencies in operating large scale computing resources, advancednetworking, scientific computing, and simulations.

With support from the University ICSE and C-SAFE CHPC has installed a cluster to enable the paral-lelization of the next generation of science and engineering codes. The cluster named Updraft is a CapabilityCluster for Large Parallel Jobs has 256 Dual-Quad Core Nodes (2048 total cores) with 2.8 GHz Intel Xeon(Harpertown) processors and 16 Gbytes memory per node (2 Gbytes per processor core) Updraft has a QlogicInfiniband DDR (InfiniPath QLE 7240) interconnect and a Gigabit Ethernet interconnect. At present 50%of Updraft is dedicated to ICSE and Uintah users. The primary use of Updraft is to prepare users for largeTeragrid machines and large DOE machines.

CHPC also runs the Ember Cluster which is a capability cluster for large parallel jobs with 262 DualSocket-Six Core Nodes (3144 total cores) using 2.8 GHz Intel Xeon (Westmere X5660) processors. Thecluster has 24 Gbytes memory per node (2 Gbytes per processor core) with a Mellanox QDR Infinibandinterconnect and a Gigabit Ethernet interconnect for management

CHPC has recenetly announced anew HP-based storage system of 0.5PB and is in the process of addinga GPU-based system in fall 2011.

Page E-2

Facilities

Required Supplementary Documents

F Organizations Involved in the Project

The University of Utah

Page F-3

Collaborators

G Collaborators

Collaborator Institution Project Member

B. Athey, (University of Michigan) BerzinsT. Ball, (MSR) GopalakrishnanR. Beauwens, (IMACS) BerzinsJ. Bhadra, (Freescale) GopalakrishnanA. Bunge, (Colorado School of Mines) BerzinsG. Bronevetsky, (LLNL) GopalakrishnanC-T. Chou, (Intel) GopalakrishnanX.Chen, (Amazon) GopalakrishnanS. Burkhardt, (MSR) GopalakrishnanB. de Supinski, (LLNL) GopalakrishnanP. Dew, (University of Leeds,UK) BerzinsS.German, (IBM) GopalakrishnanE.Farchi, (IBM) GopalakrishnanC. Goodyear, (University of Leeds,UK) BerzinsW. Gropp, (University of Illinois) GopalakrishnanT.Hilbrich, (TU Dresden) GopalakrishnanJ.Holt, (Freescale) GopalakrishnanM. Hubbard, (University of Leeds, UK) BerzinsP. Jimack, (University of Leeds, UK) BerzinsM. Leeser, (Northeastern University) GopalakrishnanE. Lusk, (Argonne National Laboratory) GopalakrishnanD.Lecomber, (Allinea Inc.) GopalakrishnanG.Li, (Samsung) GopalakrishnanH. Lu, (University of Nanjing, China) BerzinsJ.P. Luitjens, (NVIDIA) BerzinsI.Melatti, (Univ of Rome) GopalakrishnanE. Mercer, (Brigham Young University) GopalakrishnanT. Hilbrich, (TU Dresden) GopalakrishnanM. Mueller, (TU Dresden) GopalakrishnanM. Musuvathi, (MSR) GopalakrishnanJ. W. O’Leary, (Intel) GopalakrishnanR. Palmer, (Microsoft) GopalakrishnanS. Qadeer, (MSR) GopalakrishnanS. Parker, (NVIDIA) BerzinsV.Sarkar, (Rice University) GopalakrishnanM. Schulz, (LLNL) GopalakrishnanS. Siegel, (Univ Delawarea) GopalakrishnanP. Shirley, (NVIDIA) BerzinsR. Thakur, (Argonne National Laboratory) GopalakrishnanS. Vakkalanka, (Microsoft) GopalakrishnanM. Walkley, (University of Leeds, UK) BerzinsA. Wilson, (University College, UK) BerzinsJ. Wood, (University of Leeds, UK) Berzins

Page G-1

Collaborators

H Letters of Commitment

Page H-2

SHF: Medium: Formal Reliability Enhancement Methods for ... · of large simulation frameworks that...

Documents

Transcript of SHF: Medium: Formal Reliability Enhancement Methods for ... · of large simulation frameworks that...