Exascale Programming Models for Heterogeneous Systems 801039 · Exascale Programming Models for...

HORIZON 2020TOPIC FETHPC-02-2017

Transition to Exascale Computing

Exascale Programming Models for Heterogeneous Systems801039

D 5.2

Report on Initial Porting of Applications to

Large-Scale Heterogeneous Systems

WP 5: Pilot Applications and Integration

Date of preparation (latest version): 31/5/2019Copyright c© 2018 – 2021 The EPiGRAM-HS Consortium

The opinions of the authors expressed in this document do not necessarily reflect theofficial opinion of the EPiGRAM-HS partners nor of the European Commission.

D 5.2: Report on Initial Porting of Applications to Large-Scale Heterogeneous Systems2

DOCUMENT INFORMATION

Deliverable Number D 5.2Deliverable Name Report on Initial Porting of Applications to Large-Scale Heteroge-

neous SystemsDue Date 31/5/2019 (PM 9)Deliverable lead Fraunhofer ITWMAuthors Martin Kuhn (Fraunhofer ITWM)

Valeria Bartsch (Fraunhofer ITWM)Olivier Marsden (ECMWF)Ioan Hadade (ECMWF)Vyacheslav Olshevsky (KTH)Sven Anderzen (KTH)Steven W.D. Chien (KTH)Niclas Jansson (KTH)

Responsible Author Valeria Bartsch (Fraunhofer ITWM)e-mail: [email protected]

Keywords Initial PortingLarge-scale heterogeneous systems

WP/Task WP 5 /Task T5.2Nature RDissemination Level PUPlanned Date 31/5/2019Final Version Date 31/5/2019Reviewed by Chaitanya Sishtla (KTH), Timo Schneider (ETH)MGT Board Approval YES


DOCUMENT HISTORY

Partner Date Comment VersionFraunhofer ITWM 3/4/2019 First Draft 0.1ECMWF 23/4/2019 IFS part 0.2KTH 25/4/2019 NEK5000 part 0.3KTH 26/4/2019 DL apps and iPIC3D 0.4Fraunhofer ITWM 29/4/2019 GPI-ONNX 0.5KTH 30/4/2019 iPIC3D testing plan 2nd version 0.6Fraunhofer ITWM 2/5/2019 exec summary, intro, EPIGRAM-HS

prog. env., Summary added 0.7Fraunhofer ITWM 27/5/2019 Response to reviewers 0.8KTH 27/5/2019 Final version 1.0


Executive Summary

This deliverable reports on the initial porting of the applications to large-scale heteroge-neous systems. The focus is on the parts of the applications that need to be changed touse the EPiGRAM-HS programming environment. The EPiGRAM-HS programming en-vironment is under development in WP2, WP3 and WP4. The applications are preparingto use the new communication features developed by WP2 in the communication inten-sive parts of the code. The memory abstraction layer developed in WP3 will be used bymost applications which need to use data located on diverse memory system. In WP4the task-based workflow management system GPI-Space for parallel applications is beingextended and a domain specific language for artificial neural systems is being designedand implemented. The developments of WP4 are targeted towards the deep learningapplications. All features of the EPiGRAM-HS programming environment will be testedat least by one application.

• The two DL applications (lung cancer detection and malware classification) willbe the baseline for the comparison of the EPiGRAM-HS approaches towards deeplearning in WP4 and WP5. In the scope of WP5, GPI-ONNX will be developedto provide a framework for the training of deep neural networks distributed overmany compute entities. Overall, the deep learning applications will be evaluatedwith GPI-ONNX and TensorFlow in WP5, the deep learning framework built ontop of GPI-Space and the domain specific language for artificial neural networks inWP4.

• The weather forecast IFS code will evaluate persistent MPI collectives in the spheri-cal transform dwarf that repetitively uses Fast Fourier Transform (FFT) and matrixmatrix multiplication (DGEMM) . Fine-point and end-point MPI semantics will beused to be able to communicate with all threads in a thread-safe manner. Two-sidedpoint-to-point MPI communication will be replaced by single-sided asynchronousGASPI communication. The physics dwarf will evaluate the memory abstractionframework of WP3.

• For the space weather application iPIC3D, a performance analysis has shown thatcurrently the MPI reduction collectives take up considerable time spent overall inthe MPI execution. Therefore, the MPI persistent collectives will be evaluated.The particle mover of the Particle-in-Cell code is most computationally expensivepart of the code. Thus, the particle mover has been selected for the first portingto GPUs using CUDA. The next step will be to use the GPU enabled MPI that isdeveloped in WP2. The CUDA Unified Virtual Memory (UVM) is being tested inpreparation to use the WP3 memory abstraction framework.

• The focus of the work on the computational fluid dynamics application Nek5000will be on accelerator-based systems. As a first step, the Nek5000 applicationis being refactored to use dynamic memory allocation. The goal is to be ableto use the memory abstraction framework of WP3. Once this change has beenvalidated, testing on the GPUs for the algebraic multigrid solver will be started.Concerning the gather scatter kernel it will be investigated to use GPU-enabledMPI communication and persistent collectives.


Contents

1 Introduction 6

2 EPiGRAM-HS Programming Environment 6

3 Artificial Intelligence: Deep-Learning Applications 83.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83.2 Lung Cancer Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83.2.2 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . 83.2.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.3 Malware Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.3.2 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . 93.3.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.4 Initial Porting Strategy of GPI-ONNX . . . . . . . . . . . . . . . . . . . 113.4.1 Layer Implementation . . . . . . . . . . . . . . . . . . . . . . . . 123.4.2 ONNX Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.4.3 ACDG Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.4.4 Communication Layer . . . . . . . . . . . . . . . . . . . . . . . . 13

3.5 Testing Plan for GPI-ONNX . . . . . . . . . . . . . . . . . . . . . . . . . 133.6 Evaluation Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4 Weather Forecasting: IFS ESCAPE Dwarfs 144.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144.2 Initial Porting Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4.2.1 IFS ESCAPE Dwarfs . . . . . . . . . . . . . . . . . . . . . . . . . 154.3 Testing Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

5 Space Weather: iPIC3D 175.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175.2 Porting and Testing Strategy . . . . . . . . . . . . . . . . . . . . . . . . 185.3 Initial Porting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

6 Computational Fluid Dynamics: Nek5000 226.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226.2 Porting and Testing Strategy . . . . . . . . . . . . . . . . . . . . . . . . 226.3 Initial Porting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

7 Summary and Future work 24


1 Introduction

This deliverable reports on the initial porting of applications to large-scale heterogeneoussystems. In this deliverable all EPiGRAM-HS applications, namely the deep-learningapplications, the weather forecast code (IFS), the space weather application (iPIC3D)and the computational fluid dynamics application (Nek5000) report on the initial portingof the application. The focus is on the parts of the applications that need to be changed touse the EPiGRAM-HS programming environment described in Section 2. All EPiGRAM-HS applications describe their test plan which aims to show an improvement to theapplication performance.

2 EPiGRAM-HS Programming Environment

The EPiGRAM-HS programming environment is being built up by developments inWP21, WP32 and WP43. It is currently in design phase. For a detailed descriptionof the EPiGRAM-HS programming environment please refer to the design documentsD2.2, D3.2 and D4.2.

The main features of the EPiGRAM-HS programming environment are:

• WP2 targets the communication in heterogeneous HPC systems. In details, thefollowing features are designed / implemented / proposed:

– MPI persistent collectives which will give applications with recuring collectivesa performance advantage.

– MPI finepoints for a thread-safe handling of multithreaded applications.

– MPI communication for CUDA which allows an efficient communication withGPUs.

– GPI communication between FPGAs.

The applications are preparing to use the features developed by WP2 in the com-munication intensive parts of the code.

• WP3 develops a memory abstraction framework that allows a simplified and effi-cient usage of diverse memory systems. Common native arrays have a fixed layoutof array elements to memory locations. The Memory Abstraction Framework willallow the user to program to an array-like object that has a more flexible underlyinglayout across diverse memory systems. It consists of the program-level abstractionsthemselves, a heterogeneous memory manager, a cost model and a low-level abstrac-tion API. The memory abstraction layer developed in WP3 will be used by most ofthe applications which need to use data located on a diverse memory system.

1WP2 “Exploiting heterogeneity for high performance communication, building on proven program-ming models”

2WP3 “Efficient and simplified usage of diverse memories”3WP4 “Computing with FPGAs, GPUs and low-power microprocessors”


DL Weather Space Weather CFDApps Forecast (IFS) (iPIC3D) (Nek5000)

WP

2

MPI comm. in general X X X XMPI persistent collectives X X XMPI finepoints XMPI RMA comm. for CUDA X X (X)general comm. with GPUs X X XGPI comm. in general X XGPI comm. with FPGAs X

WP

3

memory abstraction fw X X X

WP

4 GPI-Space DL fw XDSL for ANS X

Table 1: Summary which features of the EPiGRAM-HS programming environment willbe evaluated by which application.

• WP4 develops a domain specific language for artificial neural systems and extendsthe task-based workflow system GPI-Space to FPGA-enabled applications, whichin turn can be used together with a deep learning framework4. Both frameworksare built for the DL applications and will be able to run with FPGA-enabled ap-plications.

A summary which features of the EPiGRAM-HS programming environment will beevaluated by which application is given in Table 1. One can see that the applications ofWP5 will work together with WP2, WP3 and WP4 evaluating all parts of the EPiGRAM-HS programming environment.

The developments of WP2 can be clearly split into extensions of the Message PassingInterface (MPI) and the Global Address Space Programming Interface (GASPI) with itsonly implementation GPI. Most applications are either based either on GASPI/GPI oron MPI:

• MPI is used by iPIC3D, Nek5000 and IFS. Thus allowing these applications to testthe MPI extensions of WP2.

• GASPI/GPI is used by DL frameworks of EPiGRAM-HS (the DL framework on topof GPI-Space (WP4) and GPI-ONNX (WP5)). IFS is also interested in developinga GPI backend for their communication library.

In the following sections the motivation why the applications test certain parts of theEPiGRAM-HS programming environment and first porting results towards the EPiGRAM-HS programming environment are described in more detail.

4The Deep Learning framework on top of GPI-Space is developed in the HPDLF project funded bythe Germany ministry of education and research (BMBF).


3 Artificial Intelligence: Deep-Learning Applications

3.1 Introduction

The EPiGRAM-HS project explores possible directions of deep learning framework de-velopments in WP4 and WP5. It is the aim of the project to test the EPiGRAM-HS DLapplications described in Sections 3.2 and 3.3 with all DL frameworks proposed in thisproject.

In WP5, we explore two possible directions of parallel deep learning frameworksnamely TensorFlow and GPI-ONNX respectively. Both initiatives propose techniquesto take advantage of heterogeneous systems for improving parts of deep learning work-loads. More details on the initial design can be found in D5.1 “Report on applicationrequirements and roadmap”. In this deliverable we will describe the initial porting ofGPI-ONNX in Section 3.4 and 3.5.

3.2 Lung Cancer Detection

3.2.1 Background

Lung cancer is the deadliest form of cancer worldwide and accounts for approximately27% of cancer-related deaths in the United States [1]. The National Lung Screening Trialshowed that high-risk subjects that underwent three annual screening rounds of low-dose computed tomography (CT) reduced lung cancer mortality after 7 years by 20% incomparison to screening with chest radiography [2]. As a result of this trial, it is believedthat lung screening programs using low-dose CT scans are likely to be implemented forhigh-risk subjects in countries outside the United States soon [3].

However, a major challenge with deploying these screening programs on a sizablescale is the large amount of CT images that need to be analyzed by radiologists [3]. A lotof interest has therefore been directed to develop computer algorithms to optimize thescreening process.

3.2.2 Problem Description

The publicly open LUNA165 challenge was created to focus on the large-scale evaluation ofautomatic nodule detection algorithms and is based on the publicly available LIDC/IDRIdataset [4]. The challenge is split into two separate tasks, nodule detection and falsepositive reduction.

Nodule Detection. Nodule detection is the process of segmenting out an area of inter-est in a raw CT scan with the goal of identifying locations of possible pulmonary nodules.This is a biomedical image segmentation task for which promising results have previouslybeen achieved using the U-Net convolutional neural network architecture [5].

We will employ the U-net convolutional neural network architecture as a baselinemodel for our segmentation. The predicted nodule locations will then be further analyzedin the false positive reduction task in order to determine if the segmented locations containmalignant or benign nodules.

5https://luna16.grand-challenge.org/


False Positive Reduction. The false positive reduction task is the process of deter-mining the probability of a nodule being malignant or benign given a set of candidatelocations. Hence, this can be viewed as a classification task; given a set of nodule loca-tions the task is to determine which class a particular nodule belongs to. Convolutionalneural networks have shown favorable results in similar areas [6] and will be used as abaseline model for our classification.

3.2.3 Methodology

The nodule detection and false positive reduction tasks will be combined into a completemodel for malignant lung cancer detection as depicted in Figure 1. Given that the modelarchitecture only needs to be defined once we are able to export it to the different modeldescriptions, targeting both deep learning frameworks that will undergo work within theEPiGRAM-HS project.

Figure 1: Overview of the lung cancer detection use case.

3.3 Malware Classification

3.3.1 Background

Malware (a portmanteau for malicious software) stands for any piece of software whichis designed to harm a computer/network system or cause its malfunction or fault. Themalware industry has been established as well-organized, well-funded market dedicatedto evading and bypassing traditional security measures. Clients and enterprises usingcomputers infected with malware could be harmed in many ways: from system andnetwork disruptions to major leaks of classified information. For instance, the WannaCryransomware attack was estimated to have affected more than 200,000 computers across150 countries, with total damages ranging from hundreds of millions to billions of dollars.

3.3.2 Problem Description

The malware classification challenge has been announced by Microsoft in 20156. Thetraining data set consists of 21,736 samples of infected files with total size of 184 GB (thedata compressed into a 7z archive is some 18 GB large). Each labelled example consistsof a hexadecimal representation of file’s binary content and metadata manifest generated

6https://www.kaggle.com/c/malware-classification


by a disassembler, and is uniquely identified by 20-character hash value. Each file isinfected with malware belonging to only 1 out of 9 families: Ramnit, Lollipop, Kelihos,Vundo, Simda, Tracur, Kelihos, Obfuscator.ACY, Gatak. The task is to develop thebest mechanism for classifying files in the test set into their respective family affiliations.Besides serving in the Kaggle competition, the dataset has become a standard benchmarkfor research on modeling malware behavior [7].

Different classification techniques were used in the challenge and afterwards, includingthe traditional fingerprint extraction approaches. However, of particular interest for usare the classification methods based on convolutional neural networks, CNNs. Interestto CNNs has been booming in recent years because of their widespread use for imageclassification and image recognition tasks. It has been found that certain families ofmalware exhibit distinct patterns when their binary code is represented as images [8](also see Figure 2). Hence, all powerful machinery developed for image classificationcould also be used for malware classification.

Figure 2: Malware images belonging to different families, from left to right: Lollipop,Obfuscator.ACY, and Tracur.

3.3.3 Methodology

We have implemented a simple CNN with two convolutional layers, followed by max-pooling layer with dropout and two fully-connected layers. The network is implementedusing the Keras framework with TensorFlow backend. TensorFlow supports GPU viaCUDA, and no dedicated porting is needed. Distributed computation is also possible,but it needs a dedicated implementation. Various strategies for the conversion of binarysequences to grayscale images have been presented in literature. We use an open-sourcePython implementation available on GitHub7. The routine estimates the output imagesizes based on the length of the binary sample, the larger the sample, the larger the

7https://github.com/de6f/ML-Malware-Research


image width. Then it reshapes a vector of bytes read from the binary file into a grayscalePNG image. Three examples applied to MS Malware Classification data set are shownin Fig. 2.

3.4 Initial Porting Strategy of GPI-ONNX

The goal of the GPI-ONNX DL framework is to implement a scaling application for thetraining of Deep Neural Networks (DNNs). The implementation will be able to train afew common DNNs distributed over many distributed memory compute entities. Thecommunication between the compute entities will be handled efficiently using the GPIlibrary.

Computation patterns in DL are highly homogeneous, therefore they are good can-didates to be accelerated on hardware. Early adopters are GPUs which are nowadaysquasi-standard in DL applications and provide specific hardware extensions for deep learn-ing applications and are more energy efficient than CPUs. The main hardware platformtargeted are GPUs.

With GPI-ONNX we concentrate on Convolutional Neural Networks (CNN) which arepopular for image classification and image segmentation. Applications of CNNs includefraud identification, autonomous driving and health amongst others.

Previous work [9] extending the Deep Learning framework Caffe with GPI communi-cation routines has shown promising preliminary improvements in scalability. Our GPI-ONNX implementation will use the Open Neural Network Exchange (ONNX) format asa front end. The ONNX project provides converters between popular Deep Learningframeworks (e. g. TensorFlow) and ONNX. The back end will implement the trainingof DNN based on the DNN-libraries of the hardware vendors, e. g. NVIDIA cuDNN forGPUs or the Intel MKL library for CPUs. The advantage of this approach is a higherflexibility in reacting on hardware changes by replacing the DNN standard library andto respond to upcoming DNN formats by converting them to ONNX.

The GPI-ONNX implementation will use some part of the aforementioned HPDLFproject which is funded by the German ministry of education and research (BMBF)as depicted in Figure 3. The compiler constitutes a scanner/parser of an ONXX file,enabling the execution of existing DNN models and allowing portability in the future.Using information about the hardware and guidance from a performance model, the maingoal of the compiler is to generate an optimized DNN compute graph. This will be donevia an intermediate format called ONXX++, which augments the DNN description withhardware, execution and training information. The resulting information can then eitherbe used by GPI-Space (as it is done in WP4) or by GPI in the scope of the GPI-ONNXdevelopment. WP5 provides a reference benchmark for the GPI-ONNX implementationapproach. The advantage of the GPI-ONNX implementation of WP5 is to have fullcontrol over the data flow and the data locality.

The implementation will consist of several components: the layer implementation,the ONNX parsing, the execution of the ACyclic Directed Graph (ACDG), the layerimplementation, and the communication layer which we describe now.


Figure 3: How the WP5 is connected with HPDLF. Orange circle depicts work in WP5,green and blue HPDLF.

3.4.1 Layer Implementation

The layer implementation is the basic building block to create the Deep Learning Nets.Some basic layers will be implemented that are needed to train at least some standardCNN like e. g. ResNet. The implementation will provide a common interface of inputbuffers and output buffers such that they can be connected by the ACDG executionengine. The implementation will at least entail layers for

• Convolution 3x3, 1x1, with/without stride

• Maxpool pooling layer

• ReLu Rectifier

• Softmax classifier

3.4.2 ONNX Parsing

The parsing of the ONNX input file is part of the HPDLF project. The output willbe an ACDG describing the data flow through the layers of the net. Additional infor-mation like paths to training data, learning rate etc. have to be extracted. Until thecompiler is finished annotations are done manually on selected test examples to performthe benchmarks in this project.

3.4.3 ACDG Execution

The ACDG execution engine puts the pieces of the implementation together connectingthe different parts. Its tasks are

• invocation of the ONNX parser


• linearization of the ACDG to create an execution order

• file IO for training data, labels, etc.

• allocation of storage buffers and communication buffers

• execution of layer propagation forward/backward

• invocation of the communication layer

3.4.4 Communication Layer

The GPI interface will be the base for the implementation of the communication layer.It provides a simple interface for a straight forward overlap of communication and com-putation. This overlap is important for good scalability. The notification mechanism ofGPI is a cost efficient instrument to signal data dependencies between different computeenitities. Using this specific point to point synchronization will increase flexibility andwill avoid global barrier synchronizations.

3.5 Testing Plan for GPI-ONNX

Training of DNNs is a stochastic process: the initial weights and the order of training dataare chosen randomly, dropout adds to the stochastic nature of DNN training. So an exactcomparison of results is difficult, especially where bit identical correctness is concerned.We believe that such a high level of precision and correctness of each operation is notnecessary to ensure an adequate testing scenario. However to ensure a general level ofcorrectness different parts of the implementation will be tested in isolated tests beforethe modules are put together.

• Layer testing: A unit test is built for each layer implementation which executes onartificial test data. The results are compared to results of a naive implementation.This will ensure that the library parameter are set correctly.

• ACDG Execution Engine Testing: The layer implementations are replaced by simplemathematical operations, like e. g. multiplication by a constant factor so that theexpected result gets more predictable and can be verified more easily.

• Communication Layer Testing: The Communication layer is tested separately withsimplified test data so that the correctness of the results can be verified more easily.Additionally the communication layer is tested together with the ACDG executionengine but with dummy layer implementations to make the expected output to bemore predictable. This simplifies the debugging of implementation errors.

The GPI-ONNX implementation will be benchmarked for scalability which is a majorresult of this workpackage. The preferred hardware target to benchmark on is a GPUcluster. The communication layer is benchmarked separately for scalability at differentblock sizes to evaluate the performance of the communication scheme independently fromthe computational performance and computational jitter.


3.6 Evaluation Metric

Given the trend of growing neural network sizes and more readily available data, efficientdeep learning models take longer time to train than before. Through the use of program-ming models for heterogeneous hardware developed within the EPiGRAM-HS project,the goal of this application use case is to shorten the time needed to train the above neu-ral network setup while achieving comparable results with regards to segmentation andclassification accuracy. After the performance of the artificial benchmarks the full im-plementation is benchmarked for scalability. The neural network will be a convolutionalneural network (CNN).

4 Weather Forecasting: IFS ESCAPE Dwarfs

4.1 Introduction

The IFS software is a major European resource representing many years of investment bythe ECMWF’s Member States. It is used to perform operational medium-range globalweather forecasts twice daily at ECMWF. The IFS software has been developed for morethan 30 years and has been run on various hardware architectures, such as shared anddistributed memory machines, vector and scalar multi-core processors. The IFS is alarge and mature software package (over 3 million lines of Fortran code), and the de-velopment to its current status has required in excess of 100 person years. In order tofacilitate future developments and optimization strategies for the IFS, be they algorith-mic, programming or hardware-related, computational dwarves, representing individualcomponents of the overall NWP process, have been developed during the EU-fundedESCAPE H2020 project [10]. These different forecast components also have differentcomputational characteristics and memory usage patterns, illustrating a wide- range ofgeneric problem types.

4.2 Initial Porting Strategy

The ESCAPE dwarfs will be used to showcase some of the benefits that the EPiGRAM-HS programming environment developments will bring for targeting future heterogeneousexascale HPC machines. In particular, WP2 and WP3 developments will be utilized. Bestpractices as described by the other work packages, will be followed in usage of EPiGRAM-HS programming environment features for the ESCAPE dwarfs. On the WP2 side, thiswill include persistent MPI collective communications, MPI RMA communications withGPUs, GPI communications and possibly MPI fine-points. As for WP3, an implementa-tion of the proposed memory abstraction and tiling mechanism will be performed in oneof the physics IFS dwarfs.The potential for a heterogeneous execution of the IFS on CPUs with a co-model (e.g.,radiation, ocean, wave) running concurrently on an accelerator will also be investigated.The aim is to achieve this by making use of some of the functionality provided by theEPiGRAM-HS programming environment. Consequently, we expect to use both CPUsand GPUs as hardware test beds for our work.


The initial porting process for the relevant ESCAPE dwarfs will consist in enabling ap-plication characteristics equivalent to those provided by the aforementioned EPiGRAM-HS programming environment developments, with current and existing APIs and imple-mentations (i.e., without making use of any EPiGRAM-HS programming environmentdevelopments). In particular, this will entail:

• implementing persistent MPI communication semantics in the Spherical Transformsdwarf

• developing a GPI backend for the communications library used in the IFS and theESCAPE dwarfs

• enabling hybrid CPU(IFS) + GPU(dwarf) execution with current communicationsemantics

• preparing a physics dwarf suitable for the implementation of the memory abstrac-tion framework to be developed by WP3. This should allow abstraction of the datalayout, as well as reduce the level of hand-coded optimizations in the code.

4.2.1 IFS ESCAPE Dwarfs

Spectral Transform Dwarf The IFS is based on the spectral transform method whichinvolves discrete spherical harmonics transformations between physical (grid-point) spaceand spectral (spherical harmonics) space. These transformations are implemented asa Fourier transform in longitude and a Legendre transform in latitude. The Fourier-transform is performed by using the fast Fourier transform (FFT) and the Legendrevia a matrix-matrix multiplication (DGEMM). The global communications involved inthe spherical transforms are widely considered to be a bottle-neck for spectral-basedweather forecasting suites at extreme scale. Work carried out in the CRESTA projecton communication-computation overlap based on co-array Fortran demonstrated signifi-cantly enhanced scalability on the TITAN machine [11].

The dwarf is implemented as a timed loop that performs repeated transformationsfrom spectral to grid-point space and back. The transforms are implemented in the translibrary, which both the dwarf and the main IFS are linked against. Consequently, changesin the trans library via the dwarf can also be included into the main IFS if needed.

Listing 1: Spectral Transform dwarf timed loop with inverse and direct transforms.

DO JSTEP=1,ITERSZSTEP=TIMEF( )CALL INV TRANS(PSP=ZSPEC( 1 : IFLDS , : ) ,PGP=ZG)CALL DIR TRANS(PSP=ZSPEC( 1 : IFLDS , : ) ,PGP=ZG)IF (MYPROC == 1) THEN

ZTSTEP=(TIMEF()−ZTSTEP)/1000 .0 d0ENDIF

ENDDO

At horizontal grid spacing finer than approximately 1 km, these transformations utilizearound 50% of the whole execution time, 75% of which is spent in the global communica-tions required for transposing data between grid-point, Fourier and spectral representa-tions. Increasing the overlap of computations and communications in the trans library


could therefore lead to significant performance improvements on resolutions targeted forimplementation on future machines.

Persistent communications are a natural fit for the spectral transforms, due to theirrepetitive invariant nature. Because of the hybrid execution mode of the IFS (MPI -OpenMP), multiple threads are available to each rank in the trans library. However,currently MPI calls are funneled, restricting message construction and handling to asingle thread. Consequently, either fine point or end point MPI semantics should allowmulti-threading improvements to library performance.

Semi-Langrangian Advection Dwarf A key component of any NWP dynamicalcore is the advection scheme. Its purpose is to solve the PDEs modeling the transportof momentum, heat and mass on a spherical domain. The semi-Lagrangian (SL) methodis a very efficient technique for solving such transport equations mainly because of itsunconditional stability and good dispersion properties which permit accurate integrationusing long time steps. However, it is known that due to communication overheads theefficiency of the SL method reduces as resolution increases towards cloud-resolving scalesand computer architectures move towards exascale platforms.

Communications are currently implemented thanks to standard two-sided point-to-point calls. Due to the nature of the SL algorithm, which exhibits dynamic and sparsecommunication properties, one-sided semantics are potentially well-suited to expressingthe algorithm’s requirements more efficiently. Use of the one-sided API proposed by GPIwill be investigated for this dwarf, and compared with that from the MPI standard.

Physics Dwarf Numerical weather prediction combines the large-scale dynamical evo-lution of the atmosphere, governed by the Euler equations, and a variety of physicalprocesses that occur at much finer spatial and time scales and that therefore require pa-rameterization. The horizontal length scales involved in physical processes are at leastone order of magnitude smaller than current grid resolutions, meaning that different at-mospheric vertical columns can be considered independently. Physical processes thereforedo not involve communication. However, the number of processes requiring modeling, aswell as their numerical complexity, lead to physical processes making up approximatelyhalf of the cost of a high-resolution forecast.

Efficient access to field data in memory and optimized looping over vertical columnsare prerequisites to good performance on current and future HPC architectures. Cur-rently these goals can only be achieved by writing hand-optimized code, which requires asignificant effort to port to new systems and architectures. Because of these reasons, thephysical parameterization dwarf will be used to explore the tiling abstraction proposed byWP3. It is hoped that it will lead to good performance with increased code portability.

4.3 Testing Plan

Prior to any testing of performance improvements from code developments, correctnessof results must be maintained. The ESCAPE dwarfs can all be run in a bit-identicalfashion, meaning that results between two runs should be identical down to the finalbit of information. This allows easy checking that code modifications have not changed


results.

Improvements stemming from EPiGRAM-HS programming environment usage willfall into two categories, firstly semantics improvements and secondly performance im-provements. Performance metrics of interest include weak and strong scaling, and single-task time to solution.

The EPiGRAM-HS programming environment aspires to making large-scale hetero-geneous systems easier to target for scientific HPC applications. This might mean forexample better semantics for communicating with accelerators, or easier specification ofheterogeneous memory usage. While better code semantics is to an extent a subjectivenotion, EPiGRAM-HS programming environment developments specifically aimed at fa-cilitating targeting future platforms, should provide clear improvements.

Performance improvements will be assessed with care. A well-documented baselinestudy will be the starting point for each of the dwarfs to be used. Baseline scaling willbe exhibited for the spectral transforms and semi-Lagrangian dwarfs, while single-nodecomputational performance will be analysed for the physics dwarf. Performance of thedwarfs modified to make use of EPiGRAM-HS programming environment developmentswill be compared in detail to these baselines.

5 Space Weather: iPIC3D

5.1 Introduction

iPIC3D is a Particle-in-Cell (PIC) code that performs large scale kinetic simulation ofspace plasmas on supercomputers [12]. The code is widely used by the space physicscommunity for its advanced algorithms and scalability. Some of the example applica-tion includes the study of magnetic reconnection in Earth’s magnetotail [13] and daysidemagnetopause, kinetic turbulence [14] and interaction of solar wind with Earth’s magne-tosphere [15, 16]. The code is part of a multi-physics framework Space Weather ModelingFramework (SWMF) which is used for the simulation of space weather [17], but also worksas a stand-alone application.

iPIC3D is developed in C++ and uses MPI for inter-process communication. The codeis highly scalable and it achieved 80% parallel efficiency on one million MPI processes onBG/Q [18]. iPIC3D currently does not support heterogeneous system. The goal of thisporting is to enable iPIC3D to use accelerators and heterogeneous memory systems. Theinitial porting will enable and prepare future porting of iPIC3D to the EPiGRAM-HSprogramming environment.

The PIC method is a numerical method that consists of three steps. A simulation isfirst initialized by setting up the simulation environment. This includes initial particlepositions, velocities and field values on electric and magnetic fields. The code enters acomputation cycle after the initialization and the cycle is repeated until the end of thesimulation.

iPIC3D uses domain decomposition where the box shaped simulation domain is de-composed to subdomains by MPI processes. Each process repeats a simulation cycle


which consists of three main steps:

1. Fields Solver. Electric and magnetic fields are computed on a grid by Finite Dif-ference Time Domain (FDTD) by solving Maxwell’s equation. The linear system issolved with a GMRES solver to compute different quantities on the grid. Computedvalues are communicated to neighbors via ghost cell exchange.

2. Particle Mover. New particle positions and velocities are computed using theelectric and magnetic field values by interpolation to particle positions. Particleadvancement is done using a predictor-corrector scheme. After all the particles areadvanced, particles which moved outside of the subdomain are communicated toneighbor processes. A typical simulation can involve up to billions of particles.

3. Moments Calculation. The particle moments of the distribution function, suchas density, current and pressure are calculated on the grid by interpolation. Thevalues are communicated to the neighbors by ghost cell exchange.

Different data structures are used for storing subdomains in MPI processes. Fielddata are represented by 3D arrays and particles are represented by an array of structures.A structure represents one particle and contains information such as position and velocityof a particle.

5.2 Porting and Testing Strategy

In this section, we describe the characteristics of iPIC3D and the initial porting plan ofiPIC3D to use facilities provided by the EPiGRAM-HS environment. We plan to use theCPU version for baseline establishment. To understand the performance of iPIC3D interms of large scale simulation, we performed a profiling on Beskow, a Cray XC40 systemat KTH PDC 8. The simulation depicts a realistic case using 4096 processes on 128 nodes.Figure 4 shows one snapshot of the simulation step. The simulation involves billions ofparticles and is executed for 100 cycles.

We used ARM Map 9 to perform the profiling. Through the profiling, informationsuch as computation in different stages of the code, memory usage and the communicationpattern can be obtained.

Table 2: Percentage of time used by different computation stage during the simulation.Core time MPI Time

Particle Mover 45% 17.3%Field solver 26.7% 5.7%

Moments calculation 26.4% 3.6%

The profiling results is summarized in Table 2. The results show that the particlemover stage accounts for both the highest computation core time and the highest MPItime. Upon further examination, it can be seen from Figure 5 that 23.6% of the core

8https://www.pdc.kth.se/hpc-services/computing-systems/beskow-1.7374369https://www.arm.com/products/development-tools/server-and-hpc/forge/map


Figure 4: Problem used for profiling.

Figure 5: Decomposition of core time spent on iPIC3D.

time is spent on particle update and 20.4% of the core time is spent on MPI relatedactivities where particles are communicated to neighbors. The particle mover code isapproximately 150 lines long and is computationally intensive in the sense that eachprocess has to iterate through all particles in its subdomain.

The communication pattern of iPIC3D is shown in Figure 6. iPIC3D uses point-to-point communications such as MPI Isend() and MPI Irecv() to exchange informationbetween neighbors and collectives such as MPI Allreduce() for reduction in the GM-RES solver. In the case of particle update, MPI Allreduce() is used to account for thenumber of particles that are not yet communicated for.

iPIC3D is also a memory intensive application due to the large number of particlesinvolved in the simulation. Figure 7 shows a time series of memory usage by the applica-tion during simulation. It can be seen that each MPI process uses approximately 400 MBand in total that accounts for 26% of node memory usage.


Figure 6: MPI calls by iPIC3D.

Figure 7: Memory usage of iPIC3D during a simulation.

5.3 Initial Porting

The MPI profiling results from Section 5.2 reveals that apart from MPI Isend() andMPI Irecv(), MPI Allreduce() takes up considerable MPI time. The persistent col-lective on MPI Allreduce() will likely benefit both the linear solver and particle com-municator in iPIC3D. The implementation can be tested by profiling MPI performanceof the application.

The profiling results from Section 5.2 shows that the particle mover is the most com-putationally expensive part of the application. For this reason, we focus on the portingand optimization of the iPIC3D particle mover.

Table 3: Execution time of particle mover.

Type of NodeSeconds

CPU GPUTegner (Haswell+K80) 15.33 2.44

Kebnekaise (Broadwell+K80) 15.20 2.87Kebnekaise (Skylake+V100) 36.82 1.43

A proof-of-concept porting of iPIC3D particle mover to NVIDIA CUDA has beenimplemented in [19]. The porting uses the complete offload pattern, where the particle


mover is completely executed on the device. Data required by the kernel are copiedto the device before kernel launching and results are copied back to the host beforeproceeding to the next stage. The GPU particle mover kernel uses the same code as theCPU version with only slight modification and no optimization for GPU is implemented.Simple textbook techniques, such as the use of pinned memory and CUDA streams areused to improve data transfer performance.

By merely using simple porting techniques, substantial improvement can be observedfor the GPU version in Table 3. This suggests that the particle mover is indeed a suitablecandidate for using GPUs. The testing environment of the GPU particle mover can befound in [19].

As a next step in GPU mover development, GPU enabled MPI from WP2 can be usedto improve kernel execution and communication performance. Since the development ofGPU enabled MPI is still in its early stage, functionalities and interfaces are likely subjectto changes. Furthermore, the use of GPU enabled MPI may require complete portingof the entire application to GPU. For this reason, a simple stand-alone particle moverthat resembles the functionality of iPIC3D particle mover can be first developed. Thisallows the study of functionalities and performances before planning on its incorporationto iPIC3D in the long run. Currently, particles in iPIC3D are represented as an arrayof structure where each structure contains information of one particle. One potentialimprovement that will benefits GPU execution is the use of structure of arrays. Theadaption will result in major code change. The use of a stand-alone particle mover willallow better understanding to its implication before porting in the long run. The portingcan be tested using the CPU version as a performance baseline.

In connection to previous work in [19], the use of CUDA unified memory can be usedto eliminate the need for explicit management of data transfer. A limited form of UnifiedVirtual Memory (UVM) was first introduced by NVIDIA in CUDA 6 and GPUs withcompute capability 3.0. UVM presents a single address space across host and devicememory systems. It means that a single pointer can be used both by host and devicefunctions. Furthermore, the CUDA runtime manages the consistency of data in memoryusing a page fault mechanism. However, since pre-Pascal GPU architectures lack pagefault capability, an invocation of GPU kernel will always trigger a full migration of datafrom host memory to device memory. Since the introduction of Pascal architecture, pagefaulting capability has been supported. It means that it is possible for the CUDA runtimeto perform on-demand migration of data between the memory systems in both directions.Data migration can be triggered by heuristics and hardware counters implemented in theruntime. Furthermore, the support for on-demand data migration enables oversubscrip-tion of memory. Since iPIC3D is a memory intensive application, the usually limiteddevice memory poses serious limitation to the size of simulation. The use of UVM allowslarge scale simulations to be executed on GPUs. Further information on UVM can befound at [20].

One issue with the use of unified memory is its impact to performance, particularlywhen memory is oversubscribed. When a GPU kernel fetches a piece of data that doesnot exist on device memory, execution on GPU has to pause and the CUDA runtimeneeds to perform expensive page fault resolution before execution can be resumed. Forthis reason, CUDA provides a set of performance hints which indicates access pattern ofdata to allow better management by the runtime:


1. Preferred Location. To suggest a preferred location where data should be res-ident. In case of page fault, the runtime will resist migrating data away frompreferred location.

2. Accessed By. Establish direct mapping of data in memory to the system that is”accessed by” to avoid page fault when accessed by the other system.

3. Read Mostly. Suggests that the data is read mostly without write such that thehost and devices can read the data simultaneously without faults.

The initial porting to NVIDIA UVM has a number of advantages. Firstly, GPU de-vice memory itself resembles heterogeneous memory where the migration of data betweenmemory systems is managed by a runtime. The porting allows the investigation of chal-lenges involved in future porting to EPiGRAM-HS environment. Secondly, the interfaceof CUDA UVM resembles similarities to the proposed memory interface in WP3. Forexample, the WP3 interface is suggested to carry an advice to the runtime system for op-timizing data migration between memory systems. The porting can be tested by runninglarge size simulations.

6 Computational Fluid Dynamics: Nek5000

6.1 Introduction

Nek5000 is mainly written in Fortran 77, with communication (and some I/O) routinesimplemented in C as a separate library (GS) using MPI for message passing. The code isdesigned as a monolithic solver with static memory, (re)compiled for each use-case, withcase dependent parameters (array sizes). Nek is using a matrix-free formulation wherealmost all operations involve fast evaluation of small tensor-products (level 3 BLAS)on each element, and with continuity between elements ensured by (local and global)gather-scatter operations formulated as matrix-products with Boolean matrices. PortingNek5000 to large-scale heterogenous systems poses several challenges, the lack of dynamicmemory allocation makes it difficult to efficiently utilize systems with deep memory hi-erarchies. Secondly, since everything is based upon small tensor products, these kernelsmust be able to run efficiently on a given target architecture.

6.2 Porting and Testing Strategy

Within in the EPiGRAM-HS project we will refactor Nek5000 to be able to efficientlyutilize heterogenous architectures. Our main focus will be on accelerator based systems,in particular GPUs. To achieve this our first priority is to evaluate how to utilize such asystem from a memory perspective.

As mentioned in the introduction, Nek5000 only uses static memory, which limits theability of fine grained control of where memory is allocated e.g. on a GPU or on the host.To address this issue, our plan is to refactor Nek5000 to use dynamic memory allocationwhich will enable us to use the memory abstraction developed in Work package 3. Thetesting will be performed on normal CPUs, for the full code once refactored, for a set of


well known test cases. Once validated we will start testing on GPUs, and also start thememory abstraction from WP3.

Secondly we will investigate how to best implement the gather-scatter kernels on aheterogenous system. Our focus here will be to investigate the feasibility on using theGPU enabled MPI developed as part of Work package 2, as well as the new persistentcollective communication from the same Work package. This work will be more of anexperimental nature, thus it will first be implemented in Nek5000’s mini applicationNekbone, which has a low threshold for testing and evaluating new code. Most of thiswork will be focused on adapting the gather-scatter library (GS). If successful thesedevelopments will be moved over to the full Nek5000, which shares a large portion of thesame GS library as Nekbone.

To best utilize a heterogenous system we will use the knowledge gained from previ-ous work on using OpenACC with Nek5000 [21], from which we have a clear view ofwhich kernels in Nek5000 needs to be modified to run efficiently on i) a GPU and ii) ina heterogenous setting. Particularly of interest is the various pressure preconditionersin Nek, which runs on a much coarser problem then the rest of the flow field. Thesepreconditioners has been one of the bottlenecks preventing an efficient OpenACC imple-mentation. Given that some of these are based upon Algebraic Multigrid methods, usingsparse matrices, we will in EPiGRAM-HS investigate the feasibility to run parts of thesepreconditioners on the CPU, overlapped with the solvers running on the GPUs.

For the GPU implementation, which language to use depends on two factors, firstlyhow well the interoperate with Fortran 77, and secondly the interface with the MPI forGPUs developed in WP2, and how to interact with MPI code still running in the CPUs.We anticipate that all these new developments will make it possible to use Nek5000 onGPUs in an efficient way. The previous work on OpenACC [21] will also serve as abaseline to compare our new development against.

6.3 Initial Porting

Our initial work on porting Nek5000 to new architectures has been focused on introducingdynamic memory allocation to the code. In Nek5000, all variables such as flow fields arestored in several different common blocks with the size defined at compile time. Also,there is currently no mechanism in the build system to avoid allocating arrays not usedfor a given use case. Therefore our initial work has been to move away from Fortran 77,and to split up the common blocks into Fortran modules. These modules will containFortran derived types with allocatable memory for its own data. For example, instead ofstoring a set of arrays representing mesh data and different flow fields in various commonblocks, we instead define a flow field type (field_t) defined on a mesh (mesh_t) with agiven function space (space_t), as illustrated in Figure 8. A solver can then define asmany fields as necessary for a given case, at runtime.

Since our aim is not to rewrite the entire code base, we are keeping most of the kernelsin Nek5000 intact, with only slight modifications such that they could use the new derivedtypes, instead of including various common blocks. The refactored Nek5000 is still in aprototype stage, with the new memory model in place, but with very few of the kernelsnecessary to run real applications ported. The current time plan is to have an initialversion, with dynamic memory allocation ready in PM12, after which the communication


Figure 8: An illustration of the new dynamic Field type in the refactored Nek5000.

related work could commence.

7 Summary and Future work

The deliverable gives an overview over the initial porting of the applications and theevaluation of the EPiGRAM-HS programming environment. The major findings aresummarized for each application individually.

Artificial Intelligence: Deep-Learning Applications The two DL applications(lung cancer detection and malware detection) chosen in the EPiGRAM-HS context aredescribed in detail. They will be the baseline for the comparison of the EPiGRAM-HS approaches towards deep learning in WP4 and WP5. In the scope of WP5 GPI-ONNX will be developed to provide a framework for the training of deep neural networksdistributed over many compute entities. The framework will be based on GPI. In orderto test GPI-ONNX the different parts of the implementation will be tested in isolatedtests before the module are put together. The preferred hardware target is a GPU cluster.Overall the deep learning applications will be evaluated with GPI-ONNX and TensorFlowin WP5, the deep learning framework built on top of GPI-Space and the domain specificlanguage for artificial neural networks in WP4.

Weather Forecast: IFS ESCAPE dwarfs IFS will evaluate persistent MPI collec-tives in the spherical transform dwarf which heavily uses FFTs and DGEMM includinggobal transformations. Due to the repetitive nature of the transformations persistentcollectives are a good match. The communication is currently done by a single threadof a multi-threaded applications. Fine-point or end-point MPI semantics will allow fora thread-safe communication of all threads. In the semi-Langragian advection schemethe communication overheads of the two-sided point-to-point calls are large. In order


to reduce the communication overheads the single-sided asynchronous GPI program-ming model is considered an alternative to MPI and will be evaluated in the scope ofEPiGRAM-HS. The physics dwarf is computational expensive and handles a lot of data.Thus efficient access to data memory is key. The memory abstraction framework of WP3will be evaluated. All testing will be run in a bit-identical fashion.

Space Weather: iPIC3D A performance analysis has shown that currently theall reduce collective takes up considerable time spent overall in the MPI execution. There-fore the MPI persistent collectives will be evaluated by iPIC3D. The particle mover of theparticle-in-cell code is computationally expensive. Thus the particle mover is a candidatefor being ported to GPUs. Substantial improvement can already be seen by using simpleporting techniques. The next step will be to use GPU enabled MPI. The CUDA unifiedmemory in the form of the Unified Virtual Memory (UVM) is to be tested in preparationof an evaluation of EPiGRAM-HS’s memory abstraction layer. The testing will includeprofiling the performance of the application, using the CPU version as a performancebaseline and running large size simulations.

Computational Fluid Dynamics: Nek5000 The focus of the work on Nek5000 willbe on accelerator-based systems. Of particular interest are pressure preconditioners basedon algebraic multigrids. These will split up to run the preconditioners on CPUs whilerunning the solvers on GPUs. As a first step the Nek5000 application, which currentlyuses static memory that limits the ability of fine-grained control of where memory islocated, will be refactored to use dynamic memory allocation. The goal is to be able touse the memory abstraction framework of WP3. Once this change has been validatedtesting on the GPUs will be started. Concerning the gather scatter kernel of Nek5000it will be investigated to use GPU enabled MPI and persistent collectives in the Nek-bone mini-application. Previous work OpenACC will serve as a baseline to compare thedevelopments against.


References

[1] American Cancer Society. Cancer facts & figures 2016, 2016.

[2] National Lung Screening Trial Research Team. Reduced lung-cancer mortalitywith low-dose computed tomographic screening. New England Journal of Medicine,365(5):395–409, 2011.

[3] Arnaud Arindra Adiyoso Setio, Alberto Traverso, Thomas De Bel, Moira SN Berens,Cas van den Bogaard, Piergiorgio Cerello, Hao Chen, Qi Dou, Maria Evelina Fan-tacci, Bram Geurts, et al. Validation, comparison, and combination of algorithmsfor automatic detection of pulmonary nodules in computed tomography images: theluna16 challenge. Medical image analysis, 42:1–13, 2017.

[4] Samuel G Armato III, Geoffrey McLennan, Luc Bidaut, Michael F McNitt-Gray,Charles R Meyer, Anthony P Reeves, Binsheng Zhao, Denise R Aberle, Claudia IHenschke, Eric A Hoffman, et al. The lung image database consortium (lidc) andimage database resource initiative (idri): a completed reference database of lungnodules on ct scans. Medical physics, 38(2):915–931, 2011.

[5] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional net-works for biomedical image segmentation. In International Conference on Medicalimage computing and computer-assisted intervention, pages 234–241. Springer, 2015.

[6] Andre Esteva, Brett Kuprel, Roberto A Novoa, Justin Ko, Susan M Swetter, Helen MBlau, and Sebastian Thrun. Dermatologist-level classification of skin cancer withdeep neural networks. Nature, 542(7639):115, 2017.

[7] Royi Ronen, Marian Radu, Corina Feuerstein, Elad Yom-Tov, and Mansour Ahmadi.Microsoft malware classification challenge. CoRR, abs/1802.10135, 2018.

[8] L. Nataraj, S. Karthikeyan, G. Jacob, and B. S. Manjunath. Malware images: Visu-alization and automatic classification. In Proceedings of the 8th International Sym-posium on Visualization for Cyber Security, VizSec ’11, pages 4:1–4:7, New York,NY, USA, 2011. ACM.

[9] Martin Kuehn, Janis Keuper (Fehr, and Franz-Josef Pfreundt. Using gpi-2 for dis-tributed memory paralleliziation of the caffe toolbox to speed up deep neural networktraining. 05 2017.

[10] Andreas Mller, Willem Deconinck, Christian Khnlein, Gianmarco Mengaldo, MichaelLange, Nils Wedi, Peter Bauer, Piotr Smolarkiewicz, Michail Diamantakis, Sarah-Jane Lock, Mats Hamrud, Sami Saarinen, George Mozdzynski, Daniel Thiemert,Michael Glinton, Pierre Bnard, Fabrice Voitus, Charles Colavolpe, Philippe Mar-guinaud, and Nick New. The escape project: Energy-efficient scalable algorithms forweather prediction at exascale. Geoscientific Model Development Discussions, pages1–50, 01 2019.


[11] George Mozdzynski, Mats Hamrud, and Nils Wedi. A partitioned global addressspace implementation of the european centre for medium range weather forecastsintegrated forecasting system. The International Journal of High Performance Com-puting Applications, 29(3):261–273, 2015.

[12] Stefano Markidis, Giovanni Lapenta, and Rizwan-uddin. Multi-scale simulations ofplasma with iPIC3D. Mathematics and Computers in Simulation, 80(7):1509–1519,2010.

[13] Ivy Bo Peng, Juris Vencels, Giovanni Lapenta, Andrey Divin, Andris Vaivads, Er-win Laure, and Stefano Markidis. Energetic particles in magnetotail reconnection.Journal of Plasma Physics, 81(2), 2015.

[14] Vyacheslav Olshevsky, Giovanni Lapenta, and Stefano Markidis. Energetics of ki-netic reconnection in a three-dimensional null-point cluster. Physical review letters,111(4):045002, 2013.

[15] Ivy Bo Peng, Stefano Markidis, Andris Vaivads, Juris Vencels, Jorge Amaya, AndreyDivin, Erwin Laure, and Giovanni Lapenta. The formation of a magnetospherewith implicit particle-in-cell simulations. In 15th Annual International Conferenceon Computational Science (ICCS), JUN 01-03, 2015, Reykjavik Univ, Reykjavik,ICELAND, pages 1178–1187, 2015.

[16] Ivy Bo Peng, Stefano Markidis, Erwin Laure, Andreas Johlander, Andris Vaivads,Yuri Khotyaintsev, Pierre Henri, and Giovanni Lapenta. Kinetic structures of quasi-perpendicular shocks in global particle-in-cell simulations. Physics of Plasmas,22(9):092109, 2015.

[17] Yuxi Chen, Gabor Toth, Paul Cassak, Xianzhe Jia, Tamas I Gombosi, James ASlavin, Stefano Markidis, Ivy Bo Peng, Vania K Jordanova, and Michael G Hender-son. Global three-dimensional simulation of Earth’s dayside reconnection using atwo-way coupled magnetohydrodynamics with embedded particle-in-cell model: Ini-tial results. Journal of Geophysical Research: Space Physics, 122(10):10–318, 2017.

[18] Stefano Markidis, Ivy Bo Peng, Jesper Larsson Traff, Antoine Rougier, ValeriaBartsch, Rui Machado, Mirko Rahn, Alistair Hart, Daniel Holmes, Mark Bull, et al.The EPiGRAM project: preparing parallel programming models for exascale. InInternational Conference on High Performance Computing, pages 56–68. Springer,2016.

[19] Chaitanya Prasad Sishtla, Steven WD Chien, Vyacheslav Olshevsky, Erwin Laure,and Stefano Markidis. Multi-gpu acceleration of the ipic3d implicit particle-in-cellcode. arXiv preprint arXiv:1904.03684, 2019.

[20] Tesla NVIDIA. P100 white paper. NVIDIA Corporation, 2016.

[21] J. Gong, S. Markidis, E. Laure, M. Otten, P. Fischer, and M. Min. Nekbone per-formance on gpus with openacc and cuda fortran implementations. The Journal ofSupercomputing, 72(11):4160–4180, 2015.

Exascale Programming Models for Heterogeneous Systems 801039 · Exascale Programming Models for...

Documents

Transcript of Exascale Programming Models for Heterogeneous Systems 801039 · Exascale Programming Models for...