Accelerating the microphysics model CASIM using OpenACC · PDF fileAccelerating the...

Accelerating the microphysics model CASIMusing OpenACC

Alexandr Nigay

August 19, 2016

MSc in High Performance Computing

The University of Edinburgh

Year of Presentation: 2016

Abstract

Cloud microphysics is the class of weather modelling algorithms that is responsiblefor simulating precipitation. This project ported CASIM, a cloud microphysics model,to GPUs using OpenACC, a directive-based accelerator programming technology. Inthis project, CASIM runs as a plug-in of MONC, a high-resolution cloud model. Af-ter CASIM was ported to OpenACC, the performance of the CASIM+MONC hybridimproved while the performance of CASIM itself stayed approximately the same. Thesurvey of the OpenACC’s maturity was performed in the process. Several compilerbugs in the Cray’s implementation of OpenACC were discovered and a suggestion wasmade for the extension of the OpenACC specification.

Contents

1 Introduction 1

2 Background 32.1 Weather modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1.1 Atmospheric flows and Large-Eddy Simulation . . . . . . . . . 32.1.2 Microphysics . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1.3 MONC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.1.4 CASIM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 Accelerators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2.1 Graphics Processing Units . . . . . . . . . . . . . . . . . . . . 82.2.2 NVIDIA Tesla K20X . . . . . . . . . . . . . . . . . . . . . . . 82.2.3 OpenACC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2.4 Existing OpenACC-enabled Weather Models . . . . . . . . . . 12

2.3 Background summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3 OpenACC port of CASIM 143.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.2 Stages of the port . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.2.1 Determining the scope of the port . . . . . . . . . . . . . . . . 163.2.2 Turning CASIM into an entire-type component of MONC . . . 173.2.3 Creating an accelerator region . . . . . . . . . . . . . . . . . . 193.2.4 Choosing level of OpenACC parallelism for the loop . . . . . . 213.2.5 Enabling accelerator routines . . . . . . . . . . . . . . . . . . . 223.2.6 Module variables in accelerator routines . . . . . . . . . . . . . 233.2.7 Minimising memory transfers . . . . . . . . . . . . . . . . . . 273.2.8 Asynchronous execution . . . . . . . . . . . . . . . . . . . . . 30

3.3 OpenACC limitations and associated workarounds . . . . . . . . . . . . 333.3.1 private(allocatable) not supported . . . . . . . . . . . 333.3.2 Allocate/deallocate statements in OpenACC code . . . . . . . . 363.3.3 Allocatable and pointer members of derived types not supported 363.3.4 Print statements not supported in OpenACC code . . . . . . . . 373.3.5 Discussion of the encountered OpenACC limitations . . . . . . 38

3.4 Compiler bugs and associated workarounds . . . . . . . . . . . . . . . 383.4.1 Errors when optimising conditional statements in certain routines 393.4.2 Errors when passing a certain array to an accelerator routine . . 403.4.3 "Large arguments not supported" . . . . . . . . . . . . . . . . . 42

i

3.4.4 Discussion of the encountered compiler bugs . . . . . . . . . . 433.5 Summary of the OpenACC port of CASIM . . . . . . . . . . . . . . . . 43

4 Tuning for the specific hardware 444.1 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444.2 Cost of memory transfers . . . . . . . . . . . . . . . . . . . . . . . . . 444.3 Warp divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444.4 Improving theoretical occupancy . . . . . . . . . . . . . . . . . . . . . 454.5 Tuning summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5 Results and Evaluation 465.1 Evaluation of the OpenACC-ready CASIM . . . . . . . . . . . . . . . . 465.2 Performance evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5.2.1 Performance of CASIM . . . . . . . . . . . . . . . . . . . . . . 475.2.2 Performance of the CASIM+MONC hybrid . . . . . . . . . . . 495.2.3 Co-execution of multiple kernels on a single GPU . . . . . . . . 525.2.4 Summary of the performance evaluation . . . . . . . . . . . . . 56

5.3 Maturity of OpenACC . . . . . . . . . . . . . . . . . . . . . . . . . . . 565.3.1 Maturity of the OpenACC specification . . . . . . . . . . . . . 565.3.2 Maturity of the Cray implementation of OpenACC . . . . . . . 57

6 Conclusion 586.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

A List of the source code files changed by this project 60A.1 Files modified in CASIM . . . . . . . . . . . . . . . . . . . . . . . . . 60A.2 Files modified in MONC . . . . . . . . . . . . . . . . . . . . . . . . . 61

ii

List of Figures

2.1 Figure from the Met Office technical paper [1] which compared the per-formance of CASIM and the Unified Model’s standard microphysicsscheme. The timing data for these plots was obtained by running theUnified Model on the data of the weather observation experiment COPE(case of 3 August 2013) [13]. x-axis – simulated model time corre-sponding to the timeline of the COPE case. y-axis – wallclock runningtime of the UM. "3D all/no" and "3D Cld frac" are the timings of thestandard microphysics scheme. When CASIM is enabled in the UM,the running time increases by up to six times. . . . . . . . . . . . . . . 7

3.1 CASIM’s memory transfer timeline before removing unnecessary trans-actions. Note that end data deallocates all variables specified indata’s clauses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.2 CASIM’s memory transfer timeline after removing unnecessary trans-actions. Constants are now transferred once and kept on accelerator. . . 29

3.3 Mapping of CASIM’s OpenACC directives to host-accelerator events inthe case of synchronous execution. . . . . . . . . . . . . . . . . . . . . 34

3.4 Mapping of CASIM’s OpenACC directives to host-accelerator eventsafter implementing asynchronous execution and memory transfers. . . . 35

5.1 Running time measurements for the CASIM’s accelerated and CPU-only versions. Data for the accelerated version includes both the com-putation time on GPU and the time spent on host-GPU memory transfers. 49

5.2 Simplified demonstration of the increase in kernel execution time whenlaunching more threads than the GPU can execute at once. . . . . . . . 50

5.3 Additional timings captured for the accelerated version of CASIM+MONC:Setup time, Overlap time, Waiting time, Post-Wait time. . . . . . . . . . 51

5.4 MONC running time versus the number of grid columns. . . . . . . . . 535.5 Time spent in each of the three host-side CASIM-related tasks versus

the column count. Sharp increase in the waiting time indicates an in-crease in the kernel’s running time on the GPU. . . . . . . . . . . . . . 53

5.6 CASIM-related time versus the number of grid columns for the accel-erated and the CPU-only versions. For the accelerated version, the timereported is the time CPU spends on CASIM-related host-side activites,it does not include execution time of CASIM on the accelerator. . . . . 54

iii

5.7 MONC execution time versus number of CASIM kernels running onone GPU for the grid of 6400 columns divided over 8 processes. Twokernels can be executed on the GPU concurrently without significantperformance penalty. . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

iv

Listings

3.1 Multiple hotspots in CASIM’s code . . . . . . . . . . . . . . . . . . . 173.2 Refactoring required to offload a hotspot nested inside a loop . . . . . . 183.3 Wrapping CASIM’s main loop into OpenACC parallel region with nested

loop construct. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.4 Adding collapse(2) clause to the loop construct. . . . . . . . . . 203.5 The effect of adding collapse(2) clause to a loop construct. . . . 213.6 Adding "gang worker vector" clauses to the loop construct. . . 223.7 Example of accelerator routine declaration and usage. . . . . . . . . . . 233.8 Example of a module variable referenced from an accelerator routine in

CASIM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.9 Module variable made available to accelerator routine using declare

link directive. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.10 Copying constants only once per simulation instead of on each timestep. 273.11 enter data and exit data directives inside CASIM. . . . . . . . 313.12 Derived type with an allocatable member used in CASIM. . . . . . . . 363.13 Allocatable member array turned into statically allocated array. . . . . . 373.14 Avoiding use of pointer members of derived types inside accelerator

code. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383.15 Related conditional statements inside the CASIM’s accelerator routine

racw that trigger a compiler bug at the optimisation level -O3. . . . . . 393.16 Error message reported by the PTX assembler caused by the compiler

bug encountered when compiling the accelerator routine racw. . . . . . 393.17 Workaround for the compiler bug of the incorrect code generation for

the conditional statements inside the accelerator routine racw. The twooffending conditional statements were fused into one. . . . . . . . . . . 40

3.18 PTX assembler error signalled when compiling code that calls one ofthe following three CASIM’s routines: sum_procs, ensure_positive_aerosol,sum_aprocs. Text of the error edited for enhanced readability. . . . . 40

3.19 Type signature of the procedure argument that triggers a compiler bugwhen calling sum_procs, ensure_positive_aerosol, sum_aprocs. 41

3.20 Workaround for the array passing bug. The offending argument array iswrapped into a derived type. . . . . . . . . . . . . . . . . . . . . . . . 41

v

List of Tables

5.1 Launch configurations used for investigating the number of kernels thatcan share the same GPU without performance penalty. . . . . . . . . . . 54

vi

Acknowledgements

I am grateful to my supervisor, Dr Nick Brown, for the full engagement with the project,for the guidance and advice which went beyond the project’s boundaries.

I would like to express my gratitude to the Swiss National Supercomputing Centre forkindly providing time on Piz Daint without which this work would not be possible.

I would like to express my gratitude to Ben Shipway and Adrian Hill at the Met Officefor assistance with the questions about MONC’s configuration and correctness check-ing.

I am thankful to my parents for passing on the programmer genes to me. To my auntsfor supporting me. To the gloomy-in-winter and bright-in-summer Edinburgh for beinghome away from home. To my new friends for making home-away-from-home feelvery much like home.

Chapter 1

Introduction

The weather modelling fulfils multiple roles in the modern society by producing weatherforecasts for the week and predictions of the planet’s climate in the future which informthe pursuit of the sustainable human activity on Earth.

Weather models are computationally-intensive scientific programs which are amongthe top users of HPC machines. As with any other HPC code, increase in performanceenables the enhancements in the scientific output of the weather model. It translates tothe increased accuracy of the forecasts and climate predictions.

One of the two goals of this project is to improve the performance of a weather sci-ence code CASIM which implements the cloud microphysics, a subclass of weatheralgorithms that models the life-cycle of rain, snow and other precipitation in the atmo-spheric clouds. In its current state, CASIM is very slow – when coupled with the MetOffice Unified Model (UM), the UM’s execution time increases by a factor of six [1].In this project, CASIM was coupled with MONC, a high-resolution cloud model. Inthis case, CASIM accounts for 50% of the overall MONC execution time [2].

Given the stagnation of the CPU performance improvements and the data-parallel na-ture of most weather algorithms, weather models are increasingly ported to accelerators,mostly to the Graphics Processing Units. In order to simplify the programming for ac-celerators, OpenACC, a directive-based technology, emerged. It aims to increase theprogrammer productivity and allow the accelerators and CPUs to share the same appli-cation codebase. GPUs are among the accelerators that can be targeted by OpenACC.

This project aims to achieve the performance improvements of CASIM by porting it toGPUs using OpenACC as the accelerator programming technology.

OpenACC is a relatively new technology released in 2011 which is yet to receive signifi-cant adoption in the HPC community. Several works studied the maturity of OpenACCand found limitations [3–5]. The second goal of this project is to provide new in-sights into the maturity of OpenACC when used for porting a complex program, suchas CASIM, as opposed to porting a loop nest. During the process of porting CASIM,the experience of using OpenACC was documented and the conclusions drawn are pre-

1

sented in this report.

The traditional use of OpenACC, as any other accelerator technology, is the offload ofthe most time-consuming loops of a program onto accelerators. In this project, CASIMwas offloaded entirely including the full complication of the control flow and procedurecalls. This stressed OpenACC to its limits which is not the case with the most of theOpenACC-enabled programs. In this regard, this project contributes to the develop-ment of OpenACC as the accelerator programming technology by being one of the fewcomplex codes ported to it and uncovering OpenACC’s limitations in the process.

This work will be of interest to researchers planning to use OpenACC, especially ifthe code to be accelerated is more complex than several loops. The work outlines thesteps of porting an application to OpenACC and describes which problems to expect.Furthermore, the OpenACC-offloaded version of CASIM has enabled the speed-up ofthe weather model MONC, which was used as CASIM’s parent model in this project,and can be used by the weather research community.

Chapter 2 presents the background information on the weather modelling, cloud mi-crophysics, CASIM, MONC, accelerators and OpenACC. Chapter 3 describes the mainsteps of the process of porting CASIM to OpenACC, the problems encountered andtheir solutions. Chapter 4 discusses the process of tuning CASIM to the particular GPUhardware used in this project. Chapter 5 presents and discusses the main outcomes ofthe project: the OpenACC-ready version of CASIM, its performance and the researchof the OpenACC’s maturity. Chapter 6 concludes the project with a discussion of theresults as a whole and avenues for the future work. Appendix A presents the list of thesource code files changed or added by this project. All source code was submitted alongwith the dissertation report.

2

Chapter 2

Background

The topic of this project is porting the weather science code CASIM to acceleratorsusing OpenACC as the accelerator programming technology. This chapter will set upthe background for the project.

This chapter introduces all related concepts: weather forecasting, cloud modelling, thesoftware involved, and the accelerator technology.

2.1 Weather modelling

Weather is the phenomenon produced by our planet’s atmosphere interacting with itself,oceans, surface, and the outer space. It has significant impact on our daily lives andbusiness activities and is therefore important to be predicted.

Contemporary weather forecasting relies on sophisticated computer simulations. Stateof the atmosphere over a certain area at a particular moment in time is taken as theinitial conditions. This includes measurements such as temperature and pressure takenat various points spread over the area. Then, the governing equations are repeatedlysolved over this dataset to simulate its evolution into the future.

Among the most important aspects that a weather simulation must take into account arethe atmospheric flows and so-called cloud microphysics [6]. This project involves twoweather models – MONC and CASIM – one of which simulates the atmospheric flowswhile the other implements the microphysics.

2.1.1 Atmospheric flows and Large-Eddy Simulation

Flows within the layer of atmosphere adjacent to the planet’s surface play significantrole in determining weather. These flows exhibit structure in the turbulence on a very

3

wide range of scales – from millimetres to kilometres. The explicit numerical integra-tion of the governing equations, the Navier-Stokes equations, for such a wide range isimpossible to perform in realistic time frame on contemporary computers. Therefore,models appeared that capture enough information while being computationally simpler.Large-Eddy Simulation (LES) is one of them.

Large-Eddy Simulation resolves turbulence explicitly on large scales and approximatesthe effect on smaller scales. This is achieved by low-pass filtering the Navier-Stokesequations, which permits only the large-scale turbulence, and using a certain model toapproximate the small-scale turbulence [7].

Weather models can be broadly divided into two categories: specialised high-resolutionmodels which are used for detailed studying of particular weather processes and theNumerical Weather Prediction (NWP) models which produce comprehensive weatherforecasts. The Met Office Unified Model is an example of an NWP model [8]. Despitebeing computationally cheaper than the direct integration, LES is still too computation-ally expensive for NWP models. Instead, it is used in high-resolution models. MONC,one of the two models used in this project, is a high-resolution cloud model.

2.1.2 Microphysics

The cloud microphysics schemes are algorithms that simulate the effects of moistureprocesses which describe the water droplet interaction at the millimetre scale. Theseinteractions occur at the scale lower than LES’s grid resolution but their bulk effectinfluences the system at the grid level. For example, processes such as the condensationand evaporation influence the temperature. Discussions of the cloud microphysics canbe found in [9, 10].

Clouds carry water droplets and ice crystals of different sizes. Microphysics aims atpredicting the size distribution of these particles. There are two general approaches tothis: bin microphysics and bulk microphysics.

Bin microphysics calculates the distribution of the particles across multiple bins corre-sponding to different particle sizes thus producing a histogram. Physical interactions ofthese particles are then simulated which changes their distribution among the bins. Thisapproach is complex and bears high computational cost.

Bulk microphysics does not put the particles into discrete bins and instead assumes thatthe particle size distribution follows the shape of certain function such as the gammadistribution function. In this case, the physical interactions of the particles alter theshape of the function. Often, only several moments of the distribution function arepredicted and not the function itself. These moments represent bulk properties suchas the water content. This approach is computationally simpler compared to the binmicrophysics.

The physical processes modelled by microphysics can be divided into two groups –warm and cold processes. As the name suggests, the warm microphysics operates on

4

water in the liquid and vapour phases while the cold microphysics operates on the solidphase, ice.

CASIM, the second weather model used in this project, is an implementation of thecloud microphysics.

2.1.3 MONC

The technical tasks of this project involved two weather science programs – MONC andCASIM.

The Met Office/NERC Cloud model (MONC) is an atmospheric research softwarewhich implements the Large-Eddy Simulation as one of its parts [11]. It is a Fortran2003 code parallelised with MPI and specifically designed for ease of extensibility andhigh scalability on modern HPC machines.

MONC models a section of the Earth’s atmosphere as a Cartesian grid of cells whereeach cell has a certain set of properties such as the air temperature and quantities ofwater in different phases (liquid, gas, solid). Simulations progress in finite timesteps.

In terms of architecture, MONC consists of a model core and numerous components.The core contains a very small amount of basic logic such as the application’s entrypoint, the timestepping infrastructure and the global state management. The compo-nents implement all other functionality such as the atmospheric science logic itself.

A minimal MONC component is a single Fortran module that provides a set of callbacksubroutines for MONC to call at various stages of the simulation: initialisation, finali-sation and on each timestep. A component is not confined to a single Fortran module,i.e. it can consist of several modules but one and only one of them must provide thecallback interface. At MONC’s compile time, the components available in the sourcedirectory are automatically detected, compiled and linked into the application. Thisrenders the process of adding a new MONC component very straightforward. It is onlynecessary to put the new component’s module and its makefile into a new subdirectoryof the components directory of the MONC’s source tree.

MONC imposes a strict policy regarding the handling of the global model state. Noglobal variables are used. Instead, the state of the model is encapsulated into a singleFortran derived type. Upon invocation, MONC creates a single instance of this typeand exposes it to the components by passing it to their callbacks via an argument. Thisapproach is superior to using global variables because it reduces the possibility of pro-gramming errors and simplifies the process of checkpointing the application.

MONC has the ability to write and read the model state to and from a checkpoint file.This feature can be used to retrieve the calculation results from MONC and to makeits execution more fault-tolerant by restarting a simulation from a checkpoint after asystem failure.

5

Each component of MONC belongs to one of the two types: the entire-type or thecolumn-type. The difference between these two types is the way MONC invokes thecomponent’s timestep callback. The timestep callback of an entire-type componentis invoked only once per timestep and the timestep callback of a column-type compo-nent is invoked multiple times, once for each column of the Cartesian grid representingthe domain. This distinction reflects the algorithmic requirements of the components.A component that requires access to the whole grid at once must be an entire-typecomponent. In contrast, certain algorithms in atmospheric science operate within a col-umn and do not require values from the neighbouring columns. These components areof the column-type.

MONC supports many different configuration options that affect the algorithms usedand their parameters. These settings are supplied to MONC in the form of a configura-tion file. Among other options, this file contains information about the type – entireor column – of each component and the order in which their callbacks must be invokedrelative to each other.

This project did not focus on MONC but rather used it as a driver for CASIM, the codewhich was augmented with OpenACC.

2.1.4 CASIM

Cloud and Aerosol Interacting Microphysics (CASIM) is a modern Fortran code, devel-oped by the Met Office, which implements a bulk microphysics scheme [12]. Porting ofCASIM to accelerators using OpenACC was the central technical task of this project.

CASIM is not a standalone code but is rather intended to be used as a module that plugsinto some parent weather model. Both the Unified Model and MONC can be used inthis role. In this project, MONC is used as a parent model for CASIM.

The intrinsic properties of microphysics allow each domain grid column to be processedindependently. CASIM internally consists of a loop that iterates across the columns ofthe grid passed to it from the parent model and each iteration of this loop processes asingle column. This loop will be referred to as the CASIM’s main loop.

However, CASIM is declared as a column-type component of MONC, thus it is in-voked separately for each grid column and is given only a single column at a time byMONC. In this case, CASIM’s main loop performs a single iteration which processesthe currently exposed column.

Certain other details about the internals of CASIM are important to be mentioned. In itscurrent state, CASIM is a serial code which relies on the parent model to provide paral-lelisation. Architecturally, CASIM consists of the main module named micro_mainand multiple supporting modules. The main module contains the procedure invoked bythe parent model. Supporting modules encapsulate the various parts of the microphysicsalgorithm and are called from within the main module.

6

Figure 2.1: Figure from the Met Office technical paper [1] which compared the per-formance of CASIM and the Unified Model’s standard microphysics scheme. The tim-ing data for these plots was obtained by running the Unified Model on the data of theweather observation experiment COPE (case of 3 August 2013) [13]. x-axis – simu-lated model time corresponding to the timeline of the COPE case. y-axis – wallclockrunning time of the UM. "3D all/no" and "3D Cld frac" are the timings of the standardmicrophysics scheme. When CASIM is enabled in the UM, the running time increasesby up to six times.

CASIM is a computationally expensive code. Its performance impact on the Uni-fied Model was evaluated in the Met Office technical paper [1]. The performance ofCASIM was compared against the performance of UM’s standard microphysics scheme.CASIM was found to be between 24 and 54 times slower than the standard scheme, de-pending on the configuration used. The plots on the Figure 2.1 present the timing resultsfor the UM runs using CASIM and the standard scheme. When CASIM is enabled inUM, the overall running time increases by up to a factor of 6.

CASIM’s performance impact on MONC was studied as a part of the Project Prepa-ration course as presented in the course report [2]. CASIM was executed as a com-ponent of MONC for a representative microphysics test case. The profiling revealedthat CASIM accounts for 50% of the overall execution time of MONC for this testcase. CASIM’s performance impact on MONC and on UM differs significantly be-

7

cause MONC is a high-resolution model used for detailed studies of the clouds whilethe UM is a numerical weather prediction model which uses approximation to increaseperformance and comply to strict execution time limits.

Thus, CASIM has been found to be a performance problem when used with eitherMONC or the UM as its parent model. This project aims to solve these issues byporting CASIM to accelerators.

2.2 Accelerators

Accelerators are computer hardware components that specialise in completing a partic-ular type of calculation faster than a general-purpose CPU. The most widely spread typeof accelerators are the Graphics Processing Units (GPU) which specialise in interactive3D graphics calculations but can also be used to perform other types of calculations.This project targeted GPUs as the accelerators to improve CASIM’s performance.

2.2.1 Graphics Processing Units

GPUs are optimised for solving the 3D graphics rendering problem which is a data-parallel problem involving high volumes of floating-point operations. This profile isnot unique to the graphics problem; many types of scientific computations exhibit thesame properties. Therefore, GPUs are used to accelerate different applications, e.g.molecular dynamics simulations. Cloud microphysics is also a data-parallel problemwith respect to individual grid columns – each column can be processed independentlyand the same algorithm is applied to each column. Therefore, it has the potential tobenefit from the GPU acceleration.

2.2.2 NVIDIA Tesla K20X

The specific GPU model used by this project is NVIDIA Tesla K20X which is installedin Piz Daint, the machine of the Swiss National Supercomputing Centre [14]. Thissection will provide an overview of its architecture and introduce relevant terminology.

The information about the GPU was retrieved from the architecture whitepaper [15],datasheet [16], and the deviceQuery utility [17].

Tesla K20X contains 14 Streaming Multiprocessors (SM) each of which supports upto 2048 active threads. The GPU handles threads in groups of 32 called warps. EachSM of a K20X contains only 4 warp instruction schedulers, therefore only at most 4warps, or 128 threads, can issue an instruction on each clock cycle. Thus, only at most128 threads of the 2048 active threads can progress at a time. The reason for havingless resources than required to simultaneously execute all 2048 threads stems from the

8

technique GPUs employ to achieve higher performance than CPUs for data-parallelproblems.

The performance bottleneck that GPUs try to resolve is the instruction latency, i.e. thetime an instruction takes to complete. Both memory access instructions and calculationinstructions incur certain latency. GPUs hide it by rapidly switching between threads.After a thread issues an instruction, it blocks to wait for the instruction to complete andcannot issue in the meantime. In order to avoid stalling the GPU during this waitingtime, SMs switch away from the blocked threads to the threads that are ready to issue,thus trying to progress as many threads per clock cycle as possible. Most of the threadsreside in the blocked state because instruction latencies span many clock cycles. There-fore, only a relatively small proportion of threads is ready to issue on any given clockcycle. Thus, it is not necessary to provide amount of resources needed to execute all2048 threads at once. Instead, GPUs are designed to have multiple SMs each of whichoperates in the described manner.

Threads must have no runtime inter-dependencies to allow SMs to switch between themarbitrarily at any clock cycle. Hence, GPUs are designed to solve data-parallel prob-lems.

The rapid context switching between threads is possible because the register values forall 2048 active threads of an SM are kept resident in its register file at all times. An SMof a Tesla K20X contains 65536 32-bit registers.

The GPU’s memory is separate from the CPU’s memory. In the configuration used onPiz Daint, both the CPU and the GPU can directly access only own memory space. Ifvalues from one memory are needed in the other, explicit data transfers between themmust be initiated. These transfers occur over the PCIe bus which connects the devices.This bus is commonly known as being prone to become a performance bottleneck forGPU-enabled applications.

GPUs cannot be programmed with standard compilers, they require specialised toolssuch as OpenACC.

2.2.3 OpenACC

OpenACC was chosen as the accelerator programming technology for this project. Theevaluation of the alternative technologies – CUDA, OpenCL, OpenMP 4.0 – and themotivation for choosing OpenACC are presented in the Project Preparation course re-port [2].

OpenACC is an open standard for accelerator programming [18]. It is based on direc-tives, i.e. special comments placed in the code. The compiler and the runtime library areresponsible for preparing the code for execution on the accelerator. The directives guidethem in this process by highlighting the specific sections of the code to be executed onthe accelerator and specifying the parameters of the execution.

9

The rest of this section introduces the concepts of OpenACC which are important forunderstanding this document.

Format of the directives

The format of OpenACC’s directives resembles the format of OpenMP’s directives. It isdifferent for C/C++ and Fortran; this document will use the Fortran format. OpenACC’sdirectives start with a guard "!$acc" and are ignored by the compilers that do not sup-port it. The guard is followed by a directive, e.g. "!$acc parallel". The directivecan be followed by one or more clauses, e.g. "!$acc parallel default(none)".

Accelerator region

The concept of an accelerator region is central to OpenACC. An accelerator regionsurrounds the code that must be compiled for and executed on the accelerator. In thisrespect, this concept is similar to the parallel region concept in OpenMP. One way ofcreating an accelerator region is by wrapping the target section of the code with !$accparallel and !$acc end parallel directives.

Execution model

OpenACC targets an abstract accelerator model. In this model, the execution is drivenby a host CPU which can access a separate accelerator device. The host starts exe-cuting first and explicitly initiates execution on the accelerator by requesting a certainaccelerator region to be executed on it.

The accelerator in the OpenACC’s model is a parallel computing device. OpenACC ex-poses three distinct levels of parallelism. The code in an accelerator region is executedin parallel by multiple vector lanes. Several vector lanes are grouped into a worker.Several workers are grouped into a gang. Execution of an accelerator region starts byspawning several gangs. The number of vector lanes in a worker, the number of workersin a gang and the number of gangs that get spawned is implementation-defined but canbe changed with special clauses. The actual code execution is always performed by thevector lanes, the other two levels of parallelism only form a hierarchical grouping ofthe multiple vector lanes.

It must be noted that the OpenACC’s concept of a vector lane is entirely separate fromthe concept of a hardware vector lane. An OpenACC’s vector lane may or may not mapto a hardware vector lane. In particular, the control flow of OpenACC’s vector lanes isallowed to diverge.

Execution of an accelerator region starts with multiple gangs active but inside each gangonly one worker is active and inside each worker only one vector lane is active. Eachof the three levels of parallelism must be enabled explicitly and this can only be done

10

when executing a loop inside the accelerator region. Such loop must be marked withan !$acc loop directive. The directive supports three clauses for enabling the threelevels of parallelism for the loop: gang, worker, and vector. The vector clauseunlocks all vector lanes inside each worker. The worker clause unlocks all workersinside each gang. Even though multiple gangs are active at the start of an acceleratorregion, the gang clause is still necessary because by default all gangs execute the coderedundantly, i.e. they perform no work-sharing among themselves. This mode of exe-cution is called gang-redundant mode. If a loop is encountered and the gangs are in theredundant mode, each gang will execute all iterations of the loop personally. The gangclause enables the work-sharing among the gangs making the execution truly parallel.

Thus, in order to distribute iterations of a certain loop across all available vector lanesof an accelerator, the loop must be decorated with all three clauses at once – gangworker vector – to unlock all three levels of parallelism.

Overlap of execution

OpenACC’s execution model assumes that the host and the accelerator are separatedevices. OpenACC supports asynchronous execution of the accelerator relative to thehost. By default, the host blocks and waits for the accelerator to finish executing theaccelerator region. Thus, only one device is active at a time. In asynchronous mode thehost schedules the execution of the region and proceeds without waiting. This frees thehost CPU to perform other calculations while the accelerator is busy. In this case, bothdevices may be active at the same time. This overlap of execution between the host andthe accelerator is important for increasing the performance benefits because the host’scomputing resources are not wasted waiting for the accelerator.

Memory model

OpenACC’s memory model states that the host’s and the accelerator’s memory spacesmay be separate but may also be shared. Thus, a valid and efficient program must notadhere to only one of these assumptions. For the purposes of evaluating the performanceof an OpenACC program, the memory spaces must be considered separate with certainoverhead required to transfer values between them. For the purposes of evaluating thecorrectness of the program, the memory spaces must be considered to be shared, e.g. asituation when both the host and the accelerator concurrently update the same variableis considered to be a data race condition. When developing an OpenACC program, thememory spaces must be considered separate, i.e. explicit transfers between them mustalways be requested. If the spaces are shared and a transfer between them is requested,no action is taken and no actual transfer is performed.

No memory state coherence between different vector lanes is guaranteed. Thus, if twovector lanes of the accelerator update the same memory location, the result is consideredundefined.

11

Cray’s implementation of OpenACC

OpenACC is a standard and not an implementation in itself. It relies on vendors toprovide compilers that support the specification. An implementation would map theabstract execution model of OpenACC to a specific hardware platform.

This project used Cray’s implementation of OpenACC. This implementation uses NVIDIACUDA as the back-end. An OpenACC’s vector lane maps to a single CUDA thread. Aworker maps to a warp. A gang maps to a thread block. A thread block is a CUDA termfor a grouping of warps that gets assigned to a particular streaming multiprocessor forexecution. A single CUDA kernel launches one or more thread blocks each containingone or more threads.

Summary

OpenACC is a directive-based technology for accelerator programming. This projectused Cray’s implementation of OpenACC which has NVIDIA GPUs as the target hard-ware platform.

2.2.4 Existing OpenACC-enabled Weather Models

This section provides examples of OpenACC-enabled weather models.

Advection components of MONC

Angus Lepper ported an advection component of MONC to GPUs using OpenACC in2015 MSc in HPC dissertation [4].

It has been concluded that even though OpenACC is a directive-based approach whichaims to simplify the development process, adding directives alone may not be enough toport an application. Significant structural changes to the advection code were requiredto achieve that.

The code has not achieved speed-up due to insufficient amount of computation to loadthe GPU and outweigh the data transfer overhead. This suggests a certain applicationmay be not suited for the GPU-acceleration because of the nature of involved computa-tions.

The approach of the concurrent execution of the GPU and CPU has been adopted in thecurrent dissertation project and proved to be important for achieving the performanceimprovements.

That project also used Cray’s implementation of OpenACC and concluded that its sup-port for Fortran derived types is weak.

12

CAM-SE port

Norman et. al. [3] investigated the feasibility of using OpenACC to GPU-enable acomponent of the Community Atmosphere Model - Spectral Element (CAM-SE).

The development process was found to be significantly simpler with OpenACC com-pared to CUDA Fortran, which was used to perform the existing port. The OpenACCversion of the code performed 1.5 slower than the optimised CUDA Fortran versionwhich was considered to be a good result given the higher-level nature of OpenACC.

The Cray and PGI implementations of OpenACC were evaluated. The Cray compilerwas found to be more mature but still lacking solid support for derived types and GPU’sshared memory utilisation.

COSMO port

Fuhrer et. al. [19] performed a port of the atmospheric model COSMO to GPUs usingOpenACC.

Porting of code sections not critical in terms of performance was achieved by simplyadding the OpenACC directives. On the other hand, performance critical code sectionsrequired refactoring to expose more parallelism and get acceptable performance withOpenACC.

2.3 Background summary

The chapter introduced the domain background and some internals of MONC andCASIM, the two weather models involved in the project. The acceleration of CASIMwas substantiated and the tools required to complete this work – OpenACC and GPUs– were presented.

13

Chapter 3

OpenACC port of CASIM

The previous chapter provided a brief overview of weather forecasting domain, useof accelerators in computing, and of CASIM and MONC, the two weather codes thisproject worked with. The focus of the project is to improve performance of CASIMwhen used as a plug-in for MONC using accelerators programmed with OpenACC.

This chapter will cover in detail the process of OpenACC-enabling CASIM. First, thegeneral overview of the approach will be given. Then, multiple sections will describethe important work items. Final sections will cover encountered OpenACC limitationsand compiler bugs along with their workarounds.

Broadly, the work in this project may be divided into two phases. First, CASIM’s codewas ported to OpenACC and optimised for the OpenACC’s abstract model of an ac-celerator. Then, the OpenACC-ready code was optimised for the particular acceleratorhardware, NVIDIA Tesla K20X GPU.

This chapter covers development to the abstract model, the next chapter covers optimi-sation for the particular accelerator.

3.1 Methodology

This section outlines the overall approach to OpenACC porting of CASIM taken in thisproject. Subsequent sections will address this process in detail.

One of the project aims is evaluating the potential of CASIM for GPU-acceleration.Achieving this aim requires performing the OpenACC port of the code. However,CASIM is a big and complex code programmed using language features that are gener-ally not favoured by GPUs such as the complex control flow. Therefore, the OpenACC-porting of this code was performed in portions by disabling one parts of the code whilefocusing on porting the others. Disabling of the code portions was done by comment-ing sections out or placing the return statements into procedures to abruptly stopexecution of the code.

14

Evaluation of the OpenACC’s maturity, the second project aim, was performed along-side the porting process. The OpenACC-related problems and compiler bugs that emergedin the process were documented and presented in this report.

The HPC machine used in this project was Piz Daint of the Swiss National Super-computing Centre [14]. It is a 5272-node Cray XC30 machine in which each nodeis equipped with an 8-core Intel Xeon CPU and a Tesla K20X GPU. The project wascarried out in the software and hardware environment of Piz Daint.

The OpenACC compiler used for this project is the Cray’s Fortran compiler from theCray Compilation Environment toolchain version 8.3.12. This was the default versionof the Cray’s compiler toolchain available on Piz Daint during the course of the project.This version of the compiler implements the OpenACC specification version 2.0. Noimplementation of the latest OpenACC specification version 2.5 was available on themachine.

The correctness of the OpenACC-enabled CASIM was ensured by comparing the check-point files written by MONC, the parent model for CASIM used in this project. Thesefiles contain the values of the fields that are modified by CASIM. A checkpoint file wasfirst obtained from the original, unmodified version of the code. Then, the checkpointfile was written by the OpenACC-enabled version of the code. The two files were thencompared.

The source code for the project was version-controlled with SVN and hosted in the MetOffice repository.

The upcoming sections present the process of OpenACC-enabling CASIM in detail.

3.2 Stages of the port

This section will describe the main milestones of OpenACC-enabling CASIM. Subse-quent Sections 3.3 and 3.4 will cover OpenACC limitations and compiler bugs that wereencountered and overcome in the process.

The main aim of the work presented in this section was the porting of CASIM ontothe generic accelerator architecture exposed by OpenACC. This is in the contrast tothe tuning activity presented in Chapter 4 the aim of which was the optimisation of theOpenACC-enabled CASIM for the particular hardware architecture on which the coderuns.

Each step is presented in a separate subsection, which are arranged in chronologicalorder. Each subsection contains a short summary of the project’s state at that point indevelopment.

15

3.2.1 Determining the scope of the port

State of the project before this step. Starting point of the project’s technical work.The code in its original version.

The first step of accelerator-porting a code is deciding which portion of the code to port.In the case of CASIM, two options are available. First is partial offload where onlyhotspots are executed on accelerator while rest of the code remains on CPU. Secondoption is the full offload. It is possible to offload CASIM in full because it executesas a plug-in of a parent model, which is MONC in this project. The reasoning behindmaking the choice between these two options for CASIM is presented herein.

Profiling is the best source of evidence for informing this choice. As was mentioned inthe Section 2.1.4, the preliminary work carried out during the Project Preparation courseincluded the profiling of CASIM+MONC1 pair [2]. CASIM has been found to occupyabout 50% of total CASIM+MONC’s runtime. Within CASIM, multiple individualhotspots have been identified but each hotspot’s share of runtime was small while theircombined contribution was significant. This means that performance improvement canbe gained only if all hotspots are offloaded.

The partial-offload option will be considered first and the full-offload option later.Within CASIM, the multiple hotspots were contained inside the main loop, thereforeCASIM’s body can be abstractly presented as on the Listing 3.1. Offloading a hotspotwhile keeping other code on CPU in such setting requires refactoring as shown on List-ing 3.2. The original loop has to be split into three separate loops – a hotspot loopsandwiched between two loops of remaining code. The process must be repeated foreach hotspot. Such refactoring would require profound changes to CASIM’s code be-cause its hotspots are located deep in the call tree and not in the immediate body of themain loop. Such substantial changes would considerably reduce the code readabilityand maintainability. In addition to that, this approach requires using multiple acceler-ator regions because there must be one per each offloaded hotspot. This is inferior tousing just one accelerator region for two reasons. First, each of the multiple regionsgenerates its own host-accelerator data transfers resulting in more transactions overallthan with a single region. These data transfers are known to be performance bottlenecksfor most accelerator-enabled applications and thus must be kept to a minimum. Second,having multiple accelerator regions inside CASIM would significantly complicate theoverlap of execution between accelerator and the host. Increasing this overlap is impor-tant for optimising the code for the OpenACC’s accelerator model, as was mentioned inSection 2.2.3. Thus, the partial-offload approach was expected to produce limited per-formance improvements while requiring substantial changes to the code and damagingits maintainability.

The full-offload option will now be evaluated. As was stated earlier, CASIM as a com-ponent has been discovered to occupy 50% of CASIM+MONC’s runtime. This makesCASIM itself a hotspot, from MONC’s point of view, thus making its full accelera-

1Recall that in this project CASIM runs as a MONC plug-in as was discussed in section 2.1.4

16

subroutine CASIM()do i = i_start, i_end

do j = j_start, j_end...call hotspot1()...call hotspot2()...call hotspot3()...

end doend do

end subroutine CASIM

Listing 3.1: Multiple hotspots in CASIM’s code

tion desirable. Fortunately, the cloud microphysics problem, which is being solved byCASIM, is a fully data-parallel problem and therefore fits the accelerator architecturewell. In contrast to the partial-offload approach, no substantial refactoring is required toperform the full offload of CASIM since no splitting of the main loop is necessary, thusthe code readability will not be damaged. These features make the full-offload optionsuperior over the partial-offload option.

All these arguments combined led to the conclusion that offload of the entire code is thebest strategy for accelerator-enabling CASIM. The work entailed by implementing thisapproach is detailed in subsequent sections.

State of the project after this step. Code still in its original version. Decision madeto offload CASIM onto accelerator in its entirety.

3.2.2 Turning CASIM into an entire-type component of MONC

State of the project before this step. Code still in its original version. Decisionmade to offload CASIM onto accelerator in its entirety. CASIM is column-type, as wasthe original setting.

In the configuration used in this project, CASIM runs as a component of MONC.MONC operates on a three-dimensional Cartesian grid. On each timestep each compo-nent of MONC (see Section 2.1.3 for discussion of components) is invoked and givenchance to perform the needed calculations. Components are divided into two types –entire-type and column-type. column-type components are provided access toone column at a time. They are invoked as many times per MONC’s timestep as thereare columns in the grid. In contrast, entire-type components are invoked only onceper timestep and given the entire grid.

17

! Before refactoringdo i = i_start, i_end

do j = j_start, j_endcall before()call hotspot()call after()

end doend do

! After refactoringdo i = i_start, i_end

do j = j_start, j_endcall before()

end doend dodo i = i_start, i_end ! offload this to accelerator

do j = j_start, j_endcall hotspot()

end doend dodo i = i_start, i_end

do j = j_start, j_endcall after()

end doend do

Listing 3.2: Refactoring required to offload a hotspot nested inside a loop

18

CASIM was originally of column-type because cloud microphysics calculation doesnot depend on neighbour columns’ values. However, CASIM had to be switched tothe entire-type because access to the whole field at once is needed to start parallelprocessing of columns on accelerator.

MONC determines the type of a component by looking at the configuration file providedupon invocation. It has been changed accordingly to implement the change in CASIM’stype. CASIM itself already had necessary code to handle being given the whole fieldinstead of a single column. CASIM will now be called only once per MONC’s timestep.

State of the project after this step. Code still in its original form. Configurationfile changed to switch CASIM to entire-type. CASIM is now ready to start acceptingOpenACC directives.

3.2.3 Creating an accelerator region

State of the project before this step. Code still in its original form. No OpenACCdirectives have been added to CASIM yet.

The next step in OpenACC-enabling CASIM is the creation of an accelerator region.OpenACC accelerator region is a construct that spans the code that has to be executedon accelerator, as was covered in Section 2.2.3.

The main construct in CASIM is the loop that iterates over grid columns, as has beendiscussed in Section 2.1.4. Because the decision has been made to perform a full-offload of CASIM, this loop must be placed onto accelerator entirely. It has been putinside a new accelerator region by wrapping it into a parallel directive with a loopconstruct inside as shown on Listing 3.3.

The loop construct by default only applies to the loop that immediately follows it. Inthis case, only the i-loop is parallelised while the j-loop is not. However, this arrange-ment does not suit CASIM very well. Since CASIM treats columns independently,each iteration of the inner loop j is independent of all others including those happeningwithin different iterations of the enclosing loop i. Therefore, both nested loops must beparallelised to exploit the full parallelism potential. The more parallel tasks a code gen-erates, the higher are potential performance gains of running it on a parallel accelerator.To achieve this, the two loops have been combined into one by adding a collapseclause to the loop construct as demonstrated on Listing 3.4. The argument "2" spec-ifies that the following two loops must be combined. This effectively transforms thetwo-loop nest into an equivalent single loop over the combined i-j iteration space asillustrated by Listing 3.5. The loop construct now parallelises both the i-loop andj-loop of CASIM.

CASIM’s loop contains procedure calls. Procedure calls in accelerator code requirespecial treatment according to OpenACC specification. At this stage of the project,these calls were temporarily commented out. They will be treated in a later stage.

19

subroutine CASIM()!$acc parallel!$acc loopdo i = i_start, i_end

do j = j_start, j_endmicrophysics(i,j)

end doend do!$acc end loop!$acc end parallel


Listing 3.3: Wrapping CASIM’s main loop into OpenACC parallel region with nestedloop construct.

subroutine CASIM()!$acc parallel!$acc loop collapse(2)do i = i_start, i_end




Listing 3.4: Adding collapse(2) clause to the loop construct.

20

! code that uses the "collapse" clause!$acc loop collapse(2)do i = 1, i_max

do j = 1, j_maxwork(i, j)

end doend do!$acc end loop

! the way it will be executed on accelerator!$acc loopdo k = 1, i_max*j_max

i = infer_i_from_k(k)j = infer_j_from_k(k)work(i,j)

end do!$acc end loop

Listing 3.5: The effect of adding collapse(2) clause to a loop construct.

State of the project after this step. CASIM’s main loop is wrapped into acceleratorregion and executes on accelerator. Subroutine calls within the loop are commentedout.

3.2.4 Choosing level of OpenACC parallelism for the loop

The previous section mentioned collapsing CASIM’s two-loop nest into a single loop.This generates the number of parallel tasks equal to the product of both original loops’trip counts. However, collapsing alone is not enough to distribute these tasks across theaccelerator’s parallel processing elements. For this to happen, the level of OpenACCparallelism of the loop construct must be specified. These levels were discussed inSection 2.2.3. Each iteration of the CASIM’s main loop is independent of all others.Therefore, it is possible and desirable to execute them all in parallel at the same time.This can be achieved by mapping each iteration of the loop to a single OpenACC vectorlane by unlocking all three levels of OpenACC parallelism. This is achieved by addingthe triplet of gang worker vector clauses onto the loop construct that wrapsCASIM’s main loop as shown on Listing 3.6. All iterations of the combined i-j loopare now distributed across all available accelerator resources.

A loop construct at vector level of parallelism cannot have further levels of paral-lelism inside iterations. But this is fine because there is no potential for further par-allelism inside each iteration since the only type of loop used inside is k-loop whoseiterations are not independent.

Listing 3.6 presents the final form of the loop construct used in the resulting project

21

subroutine CASIM()!$acc parallel!$acc loop collapse(2) gang worker vectordo i = i_start, i_end




Listing 3.6: Adding "gang worker vector" clauses to the loop construct.

code.

State of the project after this step. Iterations of CASIM’s main loop are distributedacross entire accelerator. Subroutine calls within the loop are still commented out.

3.2.5 Enabling accelerator routines

CASIM is a highly modular code. Individual microphysical processes and supportinglogic are all divided into separate Fortran modules and are invoked through subroutinecalls. Thus, CASIM contains a lot of procedure calls in its body. Since this code was tobe placed onto accelerators, the procedure call support in accelerator code was required.

OpenACC 2.0 provides the necessary support [18, Section 2.13] which comes in theform of the routine directive. A procedure marked with this directive gets compiledfor accelerator and can be called from accelerator regions. The directive requires aclause specifying the level of loop parallelism used from within it. For example, aroutine worker directive should mark an accelerator routine that contains a loopmarked with loop worker. A special seq clause means that the routine has noadditional OpenACC parallelism inside. It makes the encountering thread execute thebody of the routine sequentially, i.e. if there is a loop inside the routine, it will beserially executed by a single thread.

In CASIM, all accelerator routines have been marked for sequential execution withroutine seq directive and clause. This is because the loop from which all accelera-tor routine calls are made is already marked for full three OpenACC levels of parallelismas was discussed earlier, therefore no further parallelism is possible.

Every routine called from within accelerator code must be marked with the routinedirective. This includes calls from within other accelerator routines. In other words,this must be done for all nodes of the code’s call tree.

22

subroutine CASIM()!$acc parallel!$acc loop collapse(2) gang worker vectordo i = is, ie

do j = js, je...call microphysics_common(i,j)...



subroutine microphysics_common(i,j)!$acc routine seq...

end subroutine microphysics_common

Listing 3.7: Example of accelerator routine declaration and usage.

A total of 49 procedures have been turned into accelerator routines across CASIM.Listing 3.7 presents one example. Not every routine used in accelerator code had tobe marked. A routine that had been inlined into accelerator code did not require aroutine seq directive because its body had been inserted directly into the callingaccelerator code.

For most accelerator routines in CASIM, merely adding a routine seq directivewas not enough. If a routine used module variables then special treatment of thosevariables was needed as described in the next section.

State of the project after this step. Subroutine calls in accelerator code are enabledbut module variables referenced from within the routines require further treatment.

3.2.6 Module variables in accelerator routines

CASIM’s accelerator routines extensively use global (Fortran module) variables. List-ing 3.8 shows an example of this. According to the OpenACC specification, modulevariables used in accelerator routines must be decorated with a declare directive [18,Section 2.13.2].

The declare directive requires that variables appear in one of its clauses. Modulevariables can appear in one of the following clauses: create, copyin, devi-ce_resident, link. A variable can only appear in one clause, therefore choicemust be made as to which to use. All four options have been evaluated for CASIM:

23

! micro_main.F90...!$acc loop collapse(2)do i = is, ie

do j = js, je...call set_passive_fields()...

end doend do!$acc end loop...

! passive_fields.F90module passive_fields

real(wp) :: pressure(:)contains

subroutine set_passive_fields()!$acc routine seq...pressure(:) = ......

end subroutine set_passive_fieldsend module passive_fields

Listing 3.8: Example of a module variable referenced from an accelerator routine inCASIM.

24

• declare device_resident(x) clause means that variable x will be allo-cated and used only on accelerator. It will not be allocated on the host and willtherefore be unusable there. This does not suit the needs of CASIM because mostmodule variables used from accelerator code are also used from the host code, forinitialisation. Therefore, this clause was not chosen.

• declare create(x) means that storage for variable x will be allocated onaccelerator but its value on the host will not be copied to accelerator. As with theprevious option, this does not suit CASIM because variables are initialised on thehost and the copy is necessary.

• declare copyin(x) means that storage for variable x must be allocated onaccelerator and its value must be copied from the host to accelerator. If x isa module variable, then the value that will be copied is the value it had at thestart of the program. A module variable has undefined value at the start of theprogram unless it is initialised at its declaration location as in real :: x= 0. In CASIM, module variables are not initialised in that manner, they areinitialised in the code instead. Thus, declare copyin(x) will not pick upthe correct value and instead use the uninitialised value. Therefore, this clausewas not chosen for CASIM.

• declare link(x) clause gives complete control over x’s lifetime to pro-grammer. Explicitly placed directives are required to allocate storage on accel-erator and to copy the values to and from accelerator. This flexibility suits theneeds of CASIM the best. Therefore, this clause has been chosen for decoratingCASIM’s module variables.

The chosen declare link directive must be applied to each module variable usedfrom any accelerator routine in CASIM. As was stated before, there are 49 acceleratorroutines in CASIM. Almost each of them uses module variables. However, not everymodule variable is used in the accelerator code. The task of determining which mod-ule variables do require the treatment was simplified by exploiting the compiler’s errorchecking functionality. For each module variable encountered inside an acceleratorroutine Cray compiler generates the following message: "Unsupported OpenACCconstruct Global in accelerator routine without declare -<variable>". A procedure was first marked with a routine seq directive, thecompilation was attempted, the list of variables mentioned in these error messages wasretrieved, and each of them decorated with a declare link.

A total of 250 module variables have been marked with declare link across entireCASIM. An example of this may be seen on Listing 3.9.

However, marking variables with declare link is not enough. Since the clauseprovides complete control over the variable’s data lifetime, its allocation and transferto accelerator memory must now be done explicitly. The OpenACC data directive isthe tool for achieving this. Its clauses allow specifying that a certain variable must beallocated space on accelerator and/or get its values copied at the point in the programwhere the directive is placed. CASIM’s accelerator region was wrapped into data

25

! micro_main.F90use passive_fields, only: pressure, set_passive_fields...!$acc data copyin(pressure)!$acc parallel!$acc loop collapse(2)do i = is, ie

do j = js, je...call set_passive_fields()...

end doend do!$acc end loop!$acc end parallel!$acc end data...

! passive_fields.F90real(wp) :: pressure(:)!$acc declare link(pressure)...subroutine set_passive_fields()

!$acc routine seq...pressure(:) = ......

end subroutine set_passive_fields

Listing 3.9: Module variable made available to accelerator routine using declarelink directive.

directive as shown on Listing 3.9. The variables previously decorated with declarelink have been added to copyin clauses of the data directive. This causes bothallocation and copying of the variables.

Thus, module variables may now be used from accelerator routines. Hence, the ac-celerator routines have now been made fully operational. CASIM’s code now runs onaccelerator entirely.

The way this stage and the stage of the previous section were presented implies that firstall routines were marked with routine seq and only then all module variables weremade available to these routines. However, these steps were actually performed togetherfor one routine at a time. First a certain procedure was marked with routine seqand then all module variables it used were marked with declare link and madeavailable on accelerator. This process repeated until all routines have been treated.

26

module CASIM...contains

subroutine initialise_micromain()...! copy constants to accelerator here

end subroutine initialise_micromain

! this procedure is called once per timestepsubroutine shipway_microphysics()

!$acc parallel! constants must be available here

!$acc end parallelend subroutine shipway_microphysics

subroutine finalise_micromain()...

end subroutine finalise_micromainend module CASIM

Listing 3.10: Copying constants only once per simulation instead of on each timestep.

State of the project after this step. Subroutine calls are now supported completely.CASIM is fully offloaded to accelerator.

3.2.7 Minimising memory transfers

As has been discussed in the background Section 2.2.3, OpenACC’s memory modelassumes that host’s and accelerator’s memories are separate and therefore data must betransferred between them explicitly. These transfers have tendency to be performancebottlenecks as is the case with current GPUs, for example. Therefore, it is necessary tominimise the number of such transfers and their size.

In the current state of the code, the data that is transferred before each invocation ofCASIM’s accelerator region includes both input variable data and constant data. Recallthat CASIM is invoked on each of the multiple MONC’s timesteps. Hence, CASIM’saccelerator region will be executed multiple times and its data will be transferred mul-tiple times during the course of program’s execution as illustrated by Figure 3.1. Ob-viously, it is not necessary to transfer constant data on each invocation of the regionbecause these values do not change, by definition. It is desirable to transfer them onlyonce and keep them on accelerator available for use by all subsequently invoked accel-erator regions.

Consider the outline of CASIM’s body presented on the Listing 3.10. The ship-

27

CPU Accelerator

Input variable dataConstant data

Co

de execu

tion

Output data

!$acc data copyin(input_variable)

copyin(constant_data)

copyout(output_data)

!$acc parallel

! code

!$acc end parallel

!$acc end data

Code

Time Time

Input variable dataConstant data

Co

de execu

tion

Output data


copyin(constant_data)


!$acc parallel

! code

!$acc end parallel

!$acc end data

subroutine initialise_micromain()


subroutine finalise_micromain()

end subroutine finalise_micromain

Tim

este

p 1

Tim

este

p 2

Fin

alis

atio

nIn

itia

lisat

ion

...

Mo

re t

imes

tep

s

Figure 3.1: CASIM’s memory transfer timeline before removing unnecessary transac-tions. Note that end data deallocates all variables specified in data’s clauses.

28

CPU Accelerator

Input variable data

Co

de execu

tion

Output data



!$acc parallel

! code

!$acc end parallel

!$acc end data

Code

Time Time

subroutine initialise_micromain()

!$acc enter data copyin(constant_data)


subroutine finalise_micromain()

!$acc exit data

end subroutine finalise_micromain

Tim

este

p 1

Fin

alis

atio

nIn

itia

lisat

ion

Constant data

Input variable data

Co

de execu

tion

Output data



!$acc parallel

! code

!$acc end parallel

!$acc end data

Tim

este

p 2

...

Co

nstan

t data resid

ent in

accelerator m

emo

ryM

ore

tim

este

ps

Figure 3.2: CASIM’s memory transfer timeline after removing unnecessary transac-tions. Constants are now transferred once and kept on accelerator.

29

way_microphysics subroutine houses the accelerator region which gets invokedon each timestep. If constants are to span all these invocations, they must be transferredduring CASIM’s initialisation phase which is contained inside the initialise_mi-cromain subroutine.

The constants will be copied to accelerator in the initialisation subroutine but they mustbe available to accelerator region which resides in another subroutine, the timestep call-back shipway_microphysics. This means that this data’s lifetime on acceleratormust be dynamic and not lexical like the one created by data directive. OpenACCprovides enter data and exit data directives for this purpose. enter datastarts a data region which will last until a matching exit data is executed. Bothdirectives can reside in different procedures of the program and can be executed at anytime.

The enter data directive is now used to copy CASIM’s constants to accelerator.Role of the exit data directive is to free accelerator’s memory at the end of sim-ulation. The enter data directive has been placed into the CASIM’s initialisationsubroutine and the exit data directive – into the finalisation subroutine. This way,the dynamic data lifetime for the transferred constants spans all subsequent invocationsof the timestep subroutine. Figure 3.2 presents the updated memory transfer timeline.

Listing 3.11 presents the outline of CASIM’s body with the mentioned directives in-serted. copyin(var) clause of the enter data directive specifies that var mustbe allocated space on accelerator and its value copied from host to accelerator.

A total of 213 constants referenced from the CASIM’s accelerator code have beenmoved into these copyin clauses of enter data directive. Thus, constants arenow copied only at the start of the simulation and kept available for all subsequentlyinvoked accelerator regions.

State of the project after this step. CASIM fully executes on accelerator. Unneces-sary host-device memory transfers of constant data have been removed.

3.2.8 Asynchronous execution

Overlapping execution of CPU and accelerator brings performance benefits, as waspresented in the background discussion of the OpenACC’s accelerator model in Section2.2.3.

Currently, CASIM exhibits no overlap of CPU and accelerator activity. Fortunately,there exists a potential for improving this. MONC, the parent model of CASIM in thisproject, contains several other components that do not depend on outputs of CASIMwithin a given timestep. Therefore, these components can be executed by the CPUwhile the accelerator is working on CASIM. This requires that CASIM is launchedasynchronously. The asynchronous infrastructure for CASIM developed in this project

30

module CASIM...contains

! this procedure is called once, before any timestepssubroutine initialise_micromain()

...!$acc enter data copyin(some_constant)


! this procedure is called once per timestepsubroutine shipway_microphysics()

!$acc data copyin(input_variable)!$acc parallel

... = some_constant!$acc end parallel!$acc end data

end subroutine shipway_microphysics

! this procedure is called at the end of simulationsubroutine finalise_micromain()

!$acc exit data...

end subroutine finalise_micromainend module CASIM

Listing 3.11: enter data and exit data directives inside CASIM.

31

is based on the infrastructure for the advection components of MONC developed byAngus Lepper in 2015 MSc dissertation [4].

The first step of making CASIM asynchronous is inspecting the current state of thecode. Figure 3.3 shows how OpenACC directives map to specific events in host-accelerator timeline in the current version of the code. After launching the accelera-tor region, CPU stalls and waits for it to finish and resumes execution only after that.Two main types of operations are performed in this scenario – copying data betweenhost and accelerator, and the execution of the accelerator region. Discussion of makingthese operations asynchronous follows.

CASIM’s accelerator region was switched to asynchronous execution mode by addingan async clause to its parallel directive. No changes to the enclosed loop direc-tive were necessary. Upon encountering the parallel async directive, CPU onlyschedules the execution of the accelerator region but does not wait for it to finish.

The data directive, which is currently used to transfer data to and from accelerator,does not support async clause, i.e. it cannot be made asynchronous. Instead, theupdate directive must be used. This directive also serves the purpose of instructingthe host to send or retrieve certain data from accelerator.

The data directive has been removed from CASIM’s code. Then, transfers from hostto accelerator were implemented with update device directive and transfers in theopposite direction – with update host directive. Both were decorated with asyncclauses. The update directives only copy the values of variables but they do not al-locate space for them in accelerator memory. The allocation must happen by othermeans. The enter data directive, which has been introduced into CASIM’s initiali-sation routine as described in Section 3.2.7, is the perfect candidate for perfoming theseallocations. Therefore, all variables referenced in update directives have been addedto the aforementioned enter data directive.

The parallel and update directives were fitted with the async(async-value)version of the async clause, which accepts a single integer argument. Operations withthe same async-value get enqueued onto the same device activity queue [18, Section2.14.1]. This means that they will be executed one after the other in the order they wereenqueued. Both update and parallel directives of CASIM were given the sameargument to their async clause. This way the following order is achieved: input datatransferred to accelerator, accelerator region executed, output data transferred to host.

Figure 3.4 shows the state of CASIM’s code after these changes and how its OpenACCdirectives map to host-accelerator events now.

These changes make CASIM execute asynchronously. Upon successfully enqueueingCASIM onto accelerator, CPU proceeds to working on other components of MONC.Eventually, the completion of CASIM must be tested and results of its calculation re-turned to the parent model, MONC, to advance the simulation. OpenACC’s waitdirective was added to the code for this purpose. The directive was provided with thesame async-value as were the update async and parallel async directives

32

before. This makes the host to stall and wait for all operations in the queue associatedwith that value to complete.

In order to achieve maximum host-accelerator execution overlap, the wait directivemust be called as late as possible relative to the enqueueing of accelerator region togive the host as much time as possible to work on other tasks. Recall that CASIMgets called only once per MONC’s timestep because of being an entire-type com-ponent. Since the intention is to overlap execution of CASIM with other componentsof MONC, it is not possible to place the wait directive into the same callback thatstarts the accelerator region because in that case waiting will happen right after the startof the region and no other component of MONC will get the chance to execute in themeantime. Therefore, a new MONC component has been created whose purpose is toexecute the wait directive and then pass CASIM’s retrieved output data to the parentmodel MONC. This component has been named casim_join. The original casimcomponent and the new casim_join component have been ordered with respect toother MONC components in a way that allows the maximum host-accelerator execu-tion overlap. This has been achieved by editing MONC’s configuration file and placingcasim at the head of dynamics group2 and casim_join at the end of it.

CASIM is now executed asynchronously enabling the overlap of activity between CPUand accelerator.

This section concludes with the code in its final state for this project. The following sec-tions describe OpenACC limitations and compiler bugs encountered and solved whileimplementing the steps presented across the whole Section 3.2.

3.3 OpenACC limitations and associated workarounds

Certain features of Fortran are not supported by either OpenACC specification itselfor Cray’s implementation of OpenACC. This section presents such issues met whileworking through the steps of Section 3.2.

3.3.1 private(allocatable) not supported

Section 3.2.6 described the process of making module variables available to acceleratorroutines in CASIM. One additional problem had to be solved with these variables.

The original version of CASIM worked on a column by column basis. Based on thisassumption, all module variables have been declared to hold data relevant only to a sin-gle column, the column CASIM is currently working on. This project changed CASIMto work on the entire grid at once. Therefore, the module variables must be altered tohold data for the whole field. One way of achieving this in OpenACC is adding these

2for description of dynamics group see Section 2.1.3

33

copy data!$acc data

!$acc parallel

...

!$acc end parallel

!$acc end data

CPU Accelerator

submit code to execute

Co

de execu

tion

return

CPU idle

copy results

Figure 3.3: Mapping of CASIM’s OpenACC directives to host-accelerator events in thecase of synchronous execution.

34

enqueue async copying of in_data

!$acc update device(in_data)

async(ACC_QUEUE)

!$acc parallel async(ACC_QUEUE)

...

!$acc end parallel

!$acc update host(out_data)

async(ACC_QUEUE)

CPU Accelerator

enqueue async launch of the code

Co

de execu

tion

enqueue async copy-back of out_data

Copying in_data(in background)

Copying out_data(in background)

!$acc wait(ACC_QUEUE)

Time Time

CPU executing code of MONC’s components that do not depend on CASIM’s output

CPU idle

Figure 3.4: Mapping of CASIM’s OpenACC directives to host-accelerator events afterimplementing asynchronous execution and memory transfers.

35

type :: process_ratereal(wp), allocatable :: source(:)

end type process_rate

Listing 3.12: Derived type with an allocatable member used in CASIM.

variables to a private clause of CASIM’s loop construct which creates a copy ofeach variable per each accelerator thread that executes the loop [18, Section 2.7.10].However, most of these variables are declared with Fortran attribute allocatable.Unfortunately, the private clause does not support allocatable variables. TheOpenACC specification 2.0 does not describe such limitation, therefore it must stemfrom the Cray implementation of the standard.

This problem has been solved by extending the allocatable arrays by two dimensionsindexed by the i and j coordinates of columns. This essentially created a separate copyof array per column but in a single, shared location. In total, 55 arrays across the entireCASIM have been treated in this manner. The code that uses any of these arrays hasbeen altered to chose correct subarray using i,j indices.

3.3.2 Allocate/deallocate statements in OpenACC code

Fortran allocate and deallocate statements cannot be called inside acceleratorregions. The following Cray error is produced: "Unsupported OpenACC con-struct Calls - allocate/deallocate". This limitation comes from theCray implementation since it is not mentioned in the OpenACC specification.

In CASIM, a module, which is responsible for accumulating current calculation resultsof microphysical processes into the output array, performed allocation of temporaryscratch space for use in the current timestep. This code had to be moved to acceler-ator, therefore its allocate and deallocate statements had to be removed. Thelocal allocatable array has been turned into a module variable and its allocation hasbeen moved to module’s initialisation routine. Then it was treated as any other modulevariable used in accelerator routine as was described in Section 3.2.6.

3.3.3 Allocatable and pointer members of derived types not sup-ported

OpenACC does not support allocatable and pointer members of Fortran de-rived types. This topic is described as "deferred for a future revision" in both OpenACC2.0 [18, Section 1.9] and 2.5 specifications [20, Section 1.10]. CASIM contains onederived type with allocatable member and one derived type with pointer member.

One of CASIM’s arrays is of derived type with an allocatable member. The typesignature of this derived type is presented on Listing 3.12. Several potential options for

36

integer, parameter :: MAX_N_QVARS = 17

type :: process_ratereal(wp) :: source(MAX_N_QVARS)

end type process_rate

Listing 3.13: Allocatable member array turned into statically allocated array.

removing the allocatable attribute have been evaluated:

• Explicitly allocate the allocatable member from within accelerator code.This requires executing allocate statement for that variable. However, as hasbeen described in Section 3.3.2 above, allocate statements cannot be used inaccelerator code. Therefore, this option was discarded.

• Use Cray compiler’s -h accel_model=deep_copy flag. This flag makesOpenACC runtime perform a deep copy of derived types whenever they arecopied to accelerator. This includes copying allocatable members intactand in a usable state. However, this option has been dropped because the flag isnot standardised and its behaviour is not supported by the OpenACC specifica-tion. Furthermore, the previous year MSc project found the application of thisflag to be problematic when transferring certain derived types [4].

• Make the allocatable member statically allocated. It is known in advancethat the size of the member will not exceed 17, therefore the array can simplybe statically allocated for this size. The changes needed to implement this arepresented on Listing 3.13. This option has been chosen and implemented in thisproject.

CASIM uses a variable of derived type with pointer members. It has been replacedwith direct usage of the variables it points to. This is illustrated by Listing 3.14.

The crux of this problem is the fact that allocatable and pointer members con-tain host memory addresses and when transferred to accelerator, these addresses be-come invalid because the memory spaces are distinct, at least in the case of GPUs.

3.3.4 Print statements not supported in OpenACC code

If print or write statements are encountered in OpenACC code, a runtime error isgenerated: Thread execution stopped at FILE <file> LINE <line>due to an I/O statement. OpenACC specification [18] does not cover thistopic, therefore this limitation must be coming from the Cray’s implementation ofOpenACC.

CASIM used print statements for error reporting. All these statements were removedfrom OpenACC code for this project. The lost error reporting functionality must be

37

! mphys_switches.F90type :: process_switch...logical, pointer :: l_pracw...end type process_switchtype(process_switch) :: pswitchlogical, target :: l_pracwpswitch%l_pracw => l_pracw

! micro_main.F90 before the changeuse mphys_switches, only: pswitchif (pswitch%l_pracw) call racw(...)

! micro_main.F90 after the changeuse mphys_switches, only: l_pracwif (l_pracw) call racw(...)

Listing 3.14: Avoiding use of pointer members of derived types inside acceleratorcode.

replaced with some other mechanism. For example, error code could be put into a globalvariable which is transferred back to host and examined there after accelerator regiontermination. Due to time constraints and because of the extremely rare occurrence oferrors in CASIM, this system was not implemented during this project. Instead, thecorrectness of the program was later ensured with a separate correctness-checking stepas was described in the Methodology Section 3.1.

3.3.5 Discussion of the encountered OpenACC limitations

Limitations coming from the Cray implementation are not features explicitly docu-mented by the specification. Therefore, the conclusion that the compiler does not com-ply to the specification cannot be made.

3.4 Compiler bugs and associated workarounds

The Cray compiler used in this project supports the OpenACC 2.0 standard fully. How-ever, several compiler bugs have been encountered when using these features. Thecompiler version used was Cray CCE 8.3.12.

The machine this project used, Piz Daint, is equipped with NVIDIA Tesla K20X GPUs[14]. The compiler implements OpenACC using the NVIDIA’s general-purpose-GPUprogramming technology CUDA. The compiler directly generates the code in PTX as-

38

if (l_2mc) thencloud_number=qfields(k, i_nl)

elsecloud_number=fixed_cloud_number

end if

if (...) then...if (l_2mc) dnumber=dmass/(cloud_mass/cloud_number)...

end if

Listing 3.15: Related conditional statements inside the CASIM’s accelerator routineracw that trigger a compiler bug at the optimisation level -O3.

ptxas accretion_1.ptx, line 1138; error :Unknown symbol ’$racw$accretion___l7__’Label expected for forward reference of ’$racw$accretion___l7__’

Listing 3.16: Error message reported by the PTX assembler caused by the compiler bugencountered when compiling the accelerator routine racw.

sembly which is the assembly language used by the NVIDIA GPUs. All compiler bugsmet in this project were occasions when the Cray compiler generated incorrect PTXcode.

Solutions to these problems have not been found in publicly accessible sources. There-fore, all workarounds were developed during this project. The approach to findingworkarounds for these bugs was based on inspecting the PTX assembly generated bythe Cray compiler while referring to the description of the PTX instruction set architec-ture [21].

3.4.1 Errors when optimising conditional statements in certain rou-tines

Three of the CASIM’s accelerator routines – raut, racw, revp – triggered genera-tion of incorrect code for the conditional statements inside them. An excerpt from theroutine racw shown on the Listing 3.15 demonstrates one of such cases. Compilationof this code generates the error presented on the Listing 3.16.

The line reported in the error message corresponds to one of the conditional statementsinside the routine. The compiler fails to generate a label in the assembly for its else-branch. This error only occurs with the optimisation level -O3 and therefore must berelated to a certain optimisation. This conditional statement shares the logical expres-sion – the logical variable l_2mc – with another conditional statement further in the

39

if (...) then...if (l_2mc) then

cloud_number=qfields(k, i_nl)dnumber=dmass/(cloud_mass/cloud_number)

elsecloud_number=fixed_cloud_number

end if...

end if

Listing 3.17: Workaround for the compiler bug of the incorrect code generation forthe conditional statements inside the accelerator routine racw. The two offendingconditional statements were fused into one.

ptxas micro_main_1.ptx, line 10499; error :Arguments mistmatch for instruction ’mov’Unknown symbol ’t$5’Label expected for forward reference of ’t$5’

Listing 3.18: PTX assembler error signalled when compiling code that calls one of thefollowing three CASIM’s routines: sum_procs, ensure_positive_aerosol,sum_aprocs. Text of the error edited for enhanced readability.

body of the routine as was shown on the Listing 3.15. Thus, a hypothesis was made thatthe compiler attempts an optimisation that is supposed to link these two conditionalsbut fails. Based on this hypothesis, a refactoring as presented on the Listing 3.17 wasattempted. The two loops were fused into a single one with the intention of recreatingthe attempted optimisation. This change has successfully resolved the compiler bug.Similar refactoring was applied to the other two routines: raut and revp.

3.4.2 Errors when passing a certain array to an accelerator routine

A bug was encountered when calling three accelerator routines of CASIM: sum_-procs, ensure_positive_aerosol, sum_aprocs. When compiling the codethat calls these routines, the compilation fails with the error presented on the Listing3.18. The error is returned from the NVIDIA’s PTX assembler that is invoked after theCray’s compiler finishes generating the PTX assembly code.

The PTX assembly file was inspected and it was discovered that the error occurs in thecode that prepares a certain array argument to be passed to those routines. A commonfeature of these three routines is an argument which is an array of a certain derivedtype. The type signatures of this argument and its datatype are presented on the Listing3.19. The problem occurs only when passing this particular array, the mechanism ofpassing other types of arrays as arguments works without problems; there are many

40

type :: process_nameinteger :: idinteger :: unique_idcharacter(20) :: namelogical :: on

end type process_name

subroutine sum_procs(..., iprocs, ...)type(process_name), intent(in) :: iprocs(:)

end subroutine sum_procs

Listing 3.19: Type signature of the procedure argument that triggers a compiler bugwhen calling sum_procs, ensure_positive_aerosol, sum_aprocs.

type :: cray_workaround_iprocs_wrappertype(process_name) :: iprocs(22)integer :: iprocs_count

end type cray_workaround_iprocs_wrapper

subroutine sum_procs(..., iprocs, ...)type(cray_workaround_iprocs_wrapper), intent(in) :: iprocs

end subroutine sum_procs

Listing 3.20: Workaround for the array passing bug. The offending argument array iswrapped into a derived type.

array arguments in different CASIM’s accelerator routines.

Inspection of the PTX file revealed that when an array must be passed as an argument toan accelerator routine, a data structure is used which acts as an array descriptor storinginformation such as the array’s bounds, number of elements, pointer to the data, andother fields. Broken code is generated for the preparation of this data structure for thearray presented in the aforementioned Listing 3.19.

An attempted solution was to inline the offending routines into the caller. However, thistriggered another compiler error connected with the multiple definition of the modulevariables used inside the inlined routines. This solution was thus rejected.

The problem was solved by wrapping the offending array argument into a derived typepresented on the Listing 3.20. The original array is now placed into a member of thewrapper type and only a single instance of the wrapper type is passed to the routine.The routine subsequently accesses the array as a member of the wrapper derived type.The other member of the type stores the length of the wrapped array. The explicitstorage of the array size is necessary because the code inside the routines uses thisinformation, different call-sites of the routines use arrays of different length, and theusage of allocatable members in derived types in OpenACC code is discouraged.

41

The exact nature of this bug remained unclear. The offending array is of derived type butit does not have any allocatable or pointer members. It has a character-type memberand the support for this type is not full in the Cray’s implementation of OpenACC butthe code works correctly with this member intact when the workaround described aboveis applied. More complex codes such as CASIM should be ported to OpenACC to helpreveal similar bugs and motivate their fixing.

3.4.3 "Large arguments not supported"

As has been described before, CASIM’s main loop was wrapped into an OpenACCaccelerator region. Because the compiler used in this project employs NVIDIA CUDAas the OpenACC back-end, the accelerator region is implemented as a CUDA kernelfunction. This is the function which is subsequently executed by the GPU.

The variables referenced in the lexical scope of the accelerator region, i.e. in its im-mediate body and not in the procedures it calls, are passed through arguments of theCUDA kernel function which implements the region. However, when a certain thresh-old of size or count of such arguments is exceeded, the compiler passes these variablesdifferently by packing them into a contiguous buffer in memory on the accelerator andpassing the pointer to the base of this buffer. The code on the GPU subsequently un-packs this buffer. This feature is referred to by the compiler as "Large arguments". Thismechanism requires support from both the host side and the accelerator side. The host-side of this feature is not implemented by the Cray compiler and generates a runtimeerror and program termination: "ACC: craylibs/libcrayacc/acc_hw_nvi-dia.c:915 CRAY_ACC_ERROR - Large args not supported". Neitherthe nature of the threshold, i.e. whether it applies to the number or the size of the ar-guments, nor its numerical value were mentioned in any documentation. Empiricallyit was found that the feature is triggered when the CUDA kernel has more than 532arguments.

CASIM uses enough variables in the lexical scope of the accelerator region to triggerthis error. Therefore, it became necessary to reduce the number of CUDA kernel func-tion’s arguments used by CASIM. As a first step of solving this problem, the generatedPTX assembly for CASIM’s accelerator region was inspected. By varying the numberof variables referenced in the region, achieved by commenting some out, and recom-piling the code, it was discovered that two factors influence the number of the kernel’sarguments the most:

• the number of distinct module arrays used in the lexical scope of the acceleratorregion

• the number of times each such array is referenced in the lexical scope of theaccelerator region

These two factors are related to the arrays. Non-array variables also use the kernel’sarguments however they only occupy a single argument per variable while the arrays

42

use more than one and influence the total count more. The workaround for this bug actson the first factor by reducing the number of distinct module arrays used in the region.

Among the inputs to CASIM’s accelerator region are the so-called q-fields and aerosolfields. These are the arrays holding the data pertaining to the cloud microphysics pro-cesses themselves. These are module arrays that are passed to the CUDA kernel via itsarguments. In total, there are 26 distinct three-dimensional q-field arrays and 34 distinctthree-dimensional aerosol field arrays. Thus, there are 60 of such arrays but in total theyoccupy several hundred of the CUDA kernel’s arguments because each uses up to 10of them. The total count of the arrays was reduced by packing these three-dimensionalarrays into four-dimensional arrays. There are four logically distinct groups amongthese 60 arrays, therefore they were packed into 4 distinct four-dimensional arrays.For example, the 13 q-field tendency arrays dq1(:,:,:), ..., dq13(:,:,:)were packed into a single dq_cray_workaround(:,:,:,:) array where thedq1(:,:,:) array is now accessible at dq_cray_workaround(:,:,:,1). Thelast index now selects one of the 13 original arrays. In Fortran, the left-most array indexvaries the fastest when traversing the array contiguously. Therefore, placing the array-selection index at the right-most position made the storage corresponding to each of theoriginal 13 arrays to still be contiguous in memory. The same procedure has been per-formed on other q-fields and aerosol fields thus reducing the module array number from60 to 4. This brought the CUDA kernel’s argument count below the "Large arguments"threshold and prevented it from triggering the error.

The fact that this feature is not implemented completely in the Cray compiler suggeststhat there exist not many codes, if any, that push the OpenACC to its boundaries interms of this feature.

3.4.4 Discussion of the encountered compiler bugs

The nature of the encountered compiler bugs suggests that wider adoption of OpenACCand its wider application to a variety of complex codes will uncover more of these bugsand also motivate more comprehensive testing of the compiler.

The "large arguments" feature is particularly important for this argument because it isnot a compiler bug but rather an unimplemented feature which appears not to attractenough motivation for its realisation. This work on CASIM has thus expanded theboundaries of the OpenACC’s usage.

3.5 Summary of the OpenACC port of CASIM

This chapter concludes with CASIM being OpenACC-enabled and optimised for theOpenACC’s abstract model of an accelerator. The next section will cover the processof optimising CASIM for the specific accelerator device this project was using.

43

Chapter 4

Tuning for the specific hardware

Section 3 described the work of adapting CASIM for the OpenACC’s abstract archi-tecture of an accelerator. The performance of the CASIM+MONC hybrid after thesechanges was found to be equal to the original version. Therefore, the next step was thetuning of the OpenACC-enabled CASIM for the particular hardware the project wasusing – NVIDIA Tesla K20X GPU. This process is presented in this chapter.

4.1 Approach

As was mentioned in Section 3.4, Cray’s compiler directly generates PTX assemblycode for the NVIDIA GPUs. This enables the use of the NVIDIA’s performance tools.The tuning process was mostly driven by the analysis feature of the NVIDIA VisualProfiler [22].

4.2 Cost of memory transfers

The transfers between the host’s and the accelerator’s memory are prone to be a perfor-mance bottleneck for a GPU application. Therefore, this cost was the first metric to beinspected for performance issues.

In the case of CASIM, only 5% of the total kernel execution time was spent on the datatransfers as was reported by the Visual Profiler.

4.3 Warp divergence

Warp divergence is the situation when threads in a warp take different code paths. AnNVIDIA GPU handles this situation by executing all possible code paths and deacti-

44

vating threads that do not take the particular path while activating threads that do. Thisresults in increased execution times and loss of performance.

CASIM is a complex code with abundance of control flow statements. Therefore, thewarp divergence was expected to be a problem. However, as was confirmed by theVisual Profiler, 80% of the time threads in each warp of the CASIM’s kernel take thesame execution path. Thus, this was not found to be a performance issue for CASIM.

4.4 Improving theoretical occupancy

As reported by the Visual Profiler, CASIM’s kernel demonstrated low theoretical occu-pancy. Theoretical occupancy is the measure of the maximum number of threads thatcan fit onto a streaming multiprocessor.

Each streaming multiprocessor of an NVIDIA GPU has limited amount of registers andshared memory. These resources are divided among all threads executing on an SM.The amount of these resources each thread is using thus limits the maximum number ofthreads that can theoretically fit onto one SM. In the case of CASIM, the limiting factorwas the register usage per thread.

As was mentioned in section 2.2.2, one SM of an NVIDIA Tesla K20X GPU has 65536registers and supports up to 2048 active threads. CASIM’s kernel used 128 registers perthread, therefore only 65536/128 = 512 CASIM’s threads could fit onto one SM of aK20X. The theoretical occupancy is calculated as the ratio of the number of threads thatcan fit onto the SM subject to resource constraints to the maximum possible number.In this case, it was equal to 512/2048 = 0.25. The theoretical occupancy of 0.25 isconsidered to be low and leads to underutilisation of the GPU.

The kernel was changed to use 64 registers per thread instead of 128. This increased thenumber of threads that fit an SM to 65536/64 = 1024, thus increasing the theoreticaloccupancy to 1024/2048 = 0.5. Further decrease in the register usage was impossiblebecause of the constraints imposed by the linked CUDA library.

The register usage of the kernel was altered using the -maxrregcount 64 flag ofthe PTX assembler. Cray compiler allows passing the flags to the PTX assembler viathe -Wx flag.

4.5 Tuning summary

Thus, the process of tuning CASIM’s kernel for the Tesla K20X GPU focused on in-creasing its theoretical occupancy. The performance of this new version of the codewill be evaluated in the section 5.2.

45

Chapter 5

Results and Evaluation

The previous sections of this document covered the process of porting CASIM to OpenACCand optimising it for the particular hardware of Piz Daint, the machine this project used.This section will discuss the main outcomes of the project. These include the discussionof the consequences of porting CASIM to OpenACC, the performance characteristics ofthe new version of the code and the evaluation of OpenACC’s maturity as an acceleratorprogramming technology.

5.1 Evaluation of the OpenACC-ready CASIM

The code in its current state targets GPUs but, since it is programmed with the hardware-agnostic OpenACC, it can also be adapted to other accelerator types that are supportednow or will be supported in the future by the OpenACC compilers.

Currently, OpenACC is mainly supported by GPUs, however the range of the supporteddevices extends. For example, the Sunway SW26010 many-core processor used in theSunway TaihuLight System, which is currently the top 1 machine in the Top500 list[23], can be programmed with OpenACC [24].

The code must run on any device that supports OpenACC, however the tuning for theparticular hardware will be required to achieve maximum performance. Nevertheless,the porting stage of the work is reused in this case.

5.2 Performance evaluation

This section will investigate the performance impact of the OpenACC-acceleration ofCASIM completed in this project.

MONC and CASIM are very flexible applications allowing many different configura-tions to be used. The accelerated version of CASIM may thus perform better for certain

46

configurations and worse for others. Therefore, the performance comparison betweenthe CPU-only and the accelerated versions of CASIM must be carried across a varietyof different configurations.

A test case configuration supplied by the Met Office has been used for the performanceevaluation runs. The test case models the formation of a stratus cloud and the micro-physical interactions inside this cloud which describe the formation of rain, snow, andhail. This is one of the principal test cases for CASIM. For the performance evalu-ation, only the domain size parameter will be varied. This will generate a family ofconfigurations across which the performance of CASIM will be evaluated.

The domain grid is three-dimensional and hence has three size parameters – x,y,z –one for each dimension. Columns stretch along the z-axis. Recall from Section 3.2.3that CASIM’s loop that is parallelised is the column loop, i.e. the loop which iteratesover the columns and each iteration of which processes a single column. Hence, theproduct x*y determines the number of parallel tasks while the z-size scales the amountof computational work inside each parallel task. Therefore, the effects of x and y willnot be studied independently but rather collectively as a product x*y. Thus, the twoparameters relevant for the performance studies are the x*y product and the z-size.

First, the performance of the CASIM’s original CPU-only and the accelerated versionwill be compared. However, CASIM cannot be used as a standalone application, itrequires a parent model, such as MONC or the UM, to run. This project used MONC asthe parent model for CASIM. Therefore, subsequent sections will investigate the impactof CASIM’s acceleration on the performance of the CASIM+MONC hybrid.

All files related to this analysis – MONC configuration files, batch scripts, raw data,data handling scripts – have been submitted along with the dissertation report.

5.2.1 Performance of CASIM

The analysis presented in this section focuses on the performance of CASIM, withoutconsidering the performance of a parent model, such as MONC or the UM.

The data for this analysis was obtained from single-core runs. Accelerated runs usedone GPU. Two versions of the code were run – the baseline version which is the versionprior to any changes made by this project and the accelerated version which is producedat the end of the tuning process described in Chapter 4.

On each run, the execution time of CASIM was measured. Timings for the baselineversion were obtained by wrapping calls to CASIM in a pair of MPI_Wtime functioncalls. Timings for the accelerated version were obtained using NVIDIA’s nvprofprofiler [25] and included both the time spent executing the code on GPU and the timespent performing the host-GPU memory transfers.

Changing the z-size of the grid was found not to change the results qualitatively. Thevalue was found to simply scale the plots without significantly changing their shapes

47

or positions relative to each other. Hence, the conclusions drawn will be the sameregardless of the z-size value. Thus, only the influence of the x*y parameter will bestudied in this section while the z-size will be fixed at the value of 80.

Multiple different values of the x*y parameter were used. For each value, both ver-sions of the code were run 10 times and the timing measurements aggregated using thearithmetic mean. The timings were very stable, variation was negligible, therefore thechoice of the aggregating function does not influence the conclusions.

Figure 5.1 shows the plot of CASIM’s running time versus the number of grid columns,x*y. The plot has two main features:

• both the original and the accelerated versions have nearly the same running timeuntil the column count of approximately 15000

• the running time of the accelerated CASIM increases sharply at the column countof approximately 15000

The execution times of the CPU-only and the accelerated versions of CASIM are ap-proximately the same for the column count less than roughly 15000. This suggests thatthe acceleration did not result in any significant speed-up of CASIM itself. The mainreason for this is the frequent stalling of warps which was discussed in Section ??. Fur-ther tuning of the code may be possible, however it goes beyond the time limits of thisproject. The accelerated version of CASIM may also show better performance on a dif-ferent accelerator hardware which can better tolerate the lack of the Instruction-LevelParallelism in CASIM’s code.

The sharp rise of the accelerated CASIM’s running time at the point of approximately15000 columns is caused by the properties of the particular GPU the project is using– NVIDIA Tesla K20X. This GPU has 14 streaming multiprocessors each supportingup to 2048 executing threads, as was discussed in Section 2.2.2. Then, the maximumnumber of threads that can be in the active state on this device at any given moment is14 ∗ 2048 = 28672. As has been discussed in Section 4.4, the theoretical occupancyof CASIM is 0.5 on the Tesla K20X which is caused by the scarcity of registers. Thismeans that only a maximum of 28672 ∗ 0.5 = 14336 of CASIM’s threads can be exe-cuted by the GPU at any given moment. The number of grid columns equals the numberof threads because a column maps to an iteration of CASIM’s main loop which in turnmaps to a single GPU thread. Therefore, if the number of grid columns exceeds 14336,then the number of threads will also exceed 14336 and the situation illustrated on Fig-ure 5.2 will happen where not all threads can be active at the same time. A batch of14336 threads is executed first and the excessive threads are executed only after that. Asa result, the running time increases sharply. Indeed, the increase in the CASIM’s run-ning time is observed starting from the first grid size that exceeds the barrier of 14336columns.

The rise in the running time clearly makes the accelerated version inferior to the CPU-only version in terms of performance in the range of column counts greater than 14336.Fortunately, there are reasons to conclude that the realistic use cases of CASIM+MONC

48

0

100

200

300

400

500

600

- 5 000 10 000 15 000 20 000

CA

SIM

ru

nti

me

, se

con

ds

Number of columns (x*y)

Baseline CASIM time Accelerated CASIM GPU time (including transfers)

Figure 5.1: Running time measurements for the CASIM’s accelerated and CPU-onlyversions. Data for the accelerated version includes both the computation time on GPUand the time spent on host-GPU memory transfers.

will use less than 14336 columns. For example, with the z-size of 80, which was usedfor the runs in this analysis, the number of grid points exceeds a million while a realisticconfiguration would normally use a maximum of 128 thousand grid points as describedin [11].

While limited performance improvements were achieved for some domain sizes, theaccelerated CASIM did not experience any significant speed-up relative to the originalversion in the realistic range of parameter values. Nevertheless, it may still be beneficialto run CASIM on a GPU when used as a component of some parent model because thisfrees the CPU to perform other computations while the GPU is working on CASIM.This possibility is further reinforced by the fact that the CASIM’s execution time on theGPU, inclusive of the memory transfer time, is approximately equal to the executiontime of the CPU-only version. The next section will explore this using MONC as aparent model for CASIM.

5.2.2 Performance of the CASIM+MONC hybrid

In contrast to the previous section which analysed the performance of CASIM itself,this section analyses the performance of the parent model MONC with CASIM as oneof its components. The main aim of this study is to evaluate the impact of the CASIM’sacceleration on the parent model’s running time.

This analysis uses single-core runs. The running time of the parent model, MONC, wasmeasured and compared. Two versions of the code were used – MONC coupled with theoriginal CPU-only version of CASIM and MONC coupled with the accelerated versionof CASIM produced by this project. Timings for both versions were collected usingthe MPI_Wtime calls. One call was placed at the start of MONC and the other – at

49

Less threads than the GPU can

execute at once

Time

More threads than the GPU can

execute at once

Additional execution time

threads

GPU GPU

Figure 5.2: Simplified demonstration of the increase in kernel execution time whenlaunching more threads than the GPU can execute at once.

50

GPU

CPU

OverlapSetup Post-WaitWait

CASIM kernel on the GPU

Time

Time

Figure 5.3: Additional timings captured for the accelerated version of CASIM+MONC:Setup time, Overlap time, Waiting time, Post-Wait time.

the end. Thus, the wall-clock time taken to complete the entire MONC was measured.The measurements were made across several different grid sizes. For each grid sizethe program was ran 10 times and the measured running time summarised using thearithmetic mean. The measured values were found to have negligible variance withineach 10-run batch.

Multiple additional timings were captured for the accelerated version of CASIM+MONCas illustrated by the Figure 5.3. These timings are:

• Setup time – time CPU spends on preparing the CASIM kernel for being launchedon the GPU. This does not include the host-GPU memory transfer time becausethese transfers are performed asynchronously.

• Overlap time – period of time during which both the CPU and the GPU are busy.The GPU is working on CASIM while the CPU is working on the MONC’s com-ponents that do not require CASIM’s outputs during a given timestep.

• Waiting time – time the CPU spends in a blocked state waiting for the CASIMkernel to finish on the GPU and return the results to the CPU.

• Post-Wait time – time the CPU spends post-processing the outputs of the CASIMkernel. This entirely consists of copying the data from the buffers written by theGPU to the buffers used by MONC.

All these additional timings were captured by surrounding the relevant parts of thecode with the MPI_Wtime calls and averaging the measurements across the 10 runs.This was accomplished during the same runs that were used to measure the overallexecution time. Out of these four timings, three come from the tasks that are the host-side activities required to handle the CASIM’s execution on the GPU. These are thesetup time, waiting time, and the post-wait time.

As in the previous section, varying the z-size was found not to influence the conclusionsdrawn from the analysis, therefore the z-size will be set to a fixed value of 80. Thecolumn count parameter x ∗ y, on the other hand, will be varied.

51

Figure 5.4 shows the plot of the running time versus the number of grid columns for theaccelerated and the CPU-only versions of CASIM+MONC. The figure has two mainfeatures:

• the accelerated CASIM+MONC is substantially faster than the CPU-only versionup to the column count of 14336

• running time of the accelerated CASIM+MONC rises sharply at the column countof 14336

The sharp increase in the accelerated CASIM+MONC’s execution time that occurs atthe column count of 14336 is caused by the similar increase in the CASIM’s executiontime on the GPU which was described in Section 5.2.1. Out of the three components ofthe host-side CASIM-related time – setup, waiting, post-wait – only the waiting time isinfluenced by the kernel execution time on the GPU. Therefore, this component mustshow the same sharp increase in value at the 14336 columns’ boundary. Figure 5.5provides the evidence of this. The waiting time is the only component that rises invalue at the expected point.

The analysis presented in the previous section revealed that CASIM did not receivesignificant speed-up from the acceleration. However, the Figure 5.4 has shown thatthe CASIM+MONC hybrid did become faster. This improvement in performance iscaused by the fact that CASIM is now offloaded onto the accelerator while the CPUis free to perform other calculations in the meantime. Referring to the Figure 5.3,CASIM-related tasks on the CPU are: preparing CASIM to be launched on the ac-celerator, waiting for it to finish, and post-processing of its outputs. Thus, the overallspeed-up of CASIM+MONC stems from the fact that the sum of the execution timesof these three tasks is less than the time taken by CASIM in the CPU-only versionof CASIM+MONC as illustrated by the plots on the Figure 5.6. This proves that thespeed-up of CASIM+MONC was achieved because of the reduction of CASIM-relatedtime on the CPU despite the lack of speed-up of CASIM itself.

As was discussed in the previous section, the configurations of MONC used for scien-tific runs are expected to have less than 14336 columns per process. Hence, the acceler-ated CASIM+MONC is expected to operate in the mode which exhibits the speed-up.

5.2.3 Co-execution of multiple kernels on a single GPU

A node of Piz Daint contains a GPU and a multi-core CPU. Since it is possible tolaunch several processes on the multi-core CPU which will then share the same GPU,it is important to investigate how many such processes can share the accelerator with-out experiencing performance degradation. The analysis presented in this section willanswer this question.

The feature of NVIDIA GPUs that allows multiple processes to concurrently executetheir kernels on a single GPU is called Multi-Process Service (MPS) [26]. MPS workstransparently to the application and does not require any code changes. On a Cray

52

0

100

200

300

400

500

600

700

0 5 000 10 000 15 000 20 000

MO

NC

ru

nti

me

, se

con

ds


Baseline MONC walltime, Piz Daint Accelerated MONC walltime

Figure 5.4: MONC running time versus the number of grid columns.

0

50

100

150

200

250

300

350

400

0 5 000 10 000 15 000 20 000

Tim

e, s

eco

nd

s


Setup time Wait time Post-wait time

Figure 5.5: Time spent in each of the three host-side CASIM-related tasks versus thecolumn count. Sharp increase in the waiting time indicates an increase in the kernel’srunning time on the GPU.

53

0

50

100

150

200

250

300

350

400

0 5 000 10 000 15 000 20 000

Tim

e, s

eco

nd

s


Accelerated version, CASIM-related CPU-time Baseline CASIM time

Figure 5.6: CASIM-related time versus the number of grid columns for the acceleratedand the CPU-only versions. For the accelerated version, the time reported is the timeCPU spends on CASIM-related host-side activites, it does not include execution timeof CASIM on the accelerator.

Number of MPI Ranks Number of Nodes Ranks per Node8 8 18 4 28 2 48 1 8

Table 5.1: Launch configurations used for investigating the number of kernels that canshare the same GPU without performance penalty.

system, such as Piz Daint, MPS can be enabled by exporting the environment vari-able CRAY_CUDA_MPS=1. MPS supports sharing of an accelerator by multiple MPIprocesses.

The performance metric that is analysed in this section is CASIM+MONC’s executiontime as measured with MPI_Wtime calls placed at the start and at the end of the pro-gram. Sharing of an accelerator by multiple processes has been achieved by launchingmore than one MPI rank per node. CASIM+MONC was run with 8 MPI ranks, each ofwhich launches a single GPU kernel during each timestep. The 8 MPI ranks were dis-tributed across varying number of nodes. The combinations of node counts and ranksper node are presented on the Table 5.1. Ranks executing on the same node share thesame GPU. Keeping the rank count the same while varying the number of ranks pernode allowed changing the number of concurrent kernels per GPU while keeping theinter-rank communication overhead approximately the same. MONC grid size has beenfixed to 80 ∗ 80 ∗ 80 for all launch configurations. Each configuration has been run 10times and the CASIM+MONC’s execution time averaged across these runs. The timingmeasurements for this experiment have been found to be stable across the same launchconfigurations. Inspection with the NVIDIA Visual Profiler confirmed that the kernels

54

0

10

20

30

40

50

60

1 2 4 8

MO

NC

ru

nti

me

, se

con

ds

Kernels per GPU

Figure 5.7: MONC execution time versus number of CASIM kernels running on oneGPU for the grid of 6400 columns divided over 8 processes. Two kernels can be exe-cuted on the GPU concurrently without significant performance penalty.

from different ranks are launched at the same time and are executed concurrently by theGPUs.

Figure 5.7 presents the results of these runs. As can be seen, two kernels can sharethe same GPU without causing significant loss of performance. When more than twokernels execute on the same GPU, the performance decreases. This is caused by the factthat for this grid size up to two kernels can be ran without sharing the GPU’s streamingmultiprocessors. The grid size of 80 ∗ 80 ∗ 80 yields 80 ∗ 80 = 6400 grid columns.The domain is divided across 8 processes, hence each process works on 6400/8 = 800columns, which map to 800 GPU threads. CASIM uses thread blocks of size 128,therefore each kernel spawns ceiling(800/128) = 7 thread blocks. The GPU usedin this project, Tesla K20X, contains 14 Streaming Multiprocessors (SM)1. The GPUdistributes thread blocks evenly across SMs to achieve load-balancing. Hence, up to twokernels each launching 7 thread blocks can concurrently execute on the Tesla K20Xwithout sharing SMs, each occupying 7 distinct SMs out of 14. Therefore, if morethan two kernels are launched, some or all of the SMs will be used by more than onekernel. Two kernels executing on the same SM will experience decrease in performancebecause they are sharing the SM’s resources. These include, for example, the instructionschedulers of which there are only 4 per SM on a Tesla K20X [15].

Thus, the number of CASIM kernels that can co-execute on a single GPU without caus-ing performance penalty depends on the grid size, which must be chosen so that kernelsdon’t share the GPU’s streaming multiprocessors.

1As confirmed by the deviceQuery utility [17]

55

5.2.4 Summary of the performance evaluation

Thus, the performance analysis revealed that the OpenACC-porting did not significantlyenhance the performance of CASIM itself but benefited the CASIM+MONC hybrid bythe virtue of taking the burden of executing CASIM off the CPU and allowing it toperform other calculations concurrently. Sharing of a single GPU by multiple CASIMkernels without significant performance degradation was found possible provided thateach streaming multiprocessor of the GPU is used by a single kernel only.

5.3 Maturity of OpenACC

The previous sections were focused on porting CASIM to OpenACC and evaluatingthe performance implications. This section addresses the second project goal by pre-senting the insights into the OpenACC’s maturity collected in the process of portingCASIM. This includes the discussion of the programmer productivity it enables and ofthe comprehensiveness of the feature set it provides.

CASIM is a complex code involving a lot of procedure calls and complex memory usagepatterns. This stressed the capabilities of OpenACC deeper than a traditional hotspot-offload use case does and allowed a more comprehensive coverage of its features.

The discussions of the OpenACC specification itself and of its particular implementa-tion used in this project will be presented separately.

5.3.1 Maturity of the OpenACC specification

The OpenACC specification version 2.0 has been found to provide enough tools toport a complex code such as CASIM. This is illustrated by the description of the workconducted in this project (Section 3.2). Nevertheless, a suggestion for extending thespecification will be made.

Three limitations have been found in the Cray implementation of OpenACC as was de-scribed in Section 3.3. These features are not mentioned in the OpenACC specification,therefore a valid implementation of the standard is not required to support them. How-ever, supporting at least one of the three features would have significantly decreased thedevelopment time needed to port CASIM to OpenACC. This feature is the support ofthe allocatable variables in the OpenACC private clauses which was describedin Section 3.3.1. Implementing it would also benefit porting of other complex codesbecause allocatable variables is a widely used feature of Fortran. Inclusion of thissupport into the specification is required in order to deliver this feature to the compilers.The latest OpenACC specification version 2.5 [20] has been found not to address thisissue.

56

Thus, the OpenACC specification 2.0 has been found to be developed enough to en-able a porting of a complex code but expanding the specification as suggested wouldsimplify the process. However, the deep understanding of the way OpenACC maps toa particular hardware architecture is needed in order to get the maximum performanceimprovement.

5.3.2 Maturity of the Cray implementation of OpenACC

The implementation of the OpenACC specification used in this project is the Cray Com-pilation Environment version 8.3.12 which implements OpenACC specification version2.0. No implementation of OpenACC 2.5 was available on the main project machine,Piz Daint, during the course of the project.

All OpenACC features needed by this project were found to be implemented by thecompiler. However, several bugs have been encountered when using these features.

Three distinct types of compiler bugs have been encountered in this work as was de-scribed in Section 3.4. Workarounds have been found for all bugs. However, additionaltime had to be allocated for this activity, therefore the programmer productivity was de-creased. The wider application of OpenACC to complex codes such as CASIM woulduncover more of the similar bugs with time and make the compiler more mature.

Thus, the Cray implementation of OpenACC has been found to be mature enough toallow a porting of a complex code but only in the case if the compiler bugs encounteredin the process can be resolved. Inspection of the PTX assembly was required to solveall bugs encountered during this project and this is expected to be valid for other bugs aswell. The number of such bugs is expected to decline with the spread of the OpenACCprogramming model.

57

Chapter 6

Conclusion

This chapter relates the project’s results to its goals and outlines the possibilities for thefuture work.

The first goal of the project was the improvement of the CASIM’s performance. TheOpenACC-enabling has not brought performance benefits to CASIM itself but has en-hanced the performance of the CASIM+MONC hybrid as a whole by taking CASIM offthe CPU and allowing it to perform other work concurrently. This suggests that offload-ing part of the calculation to an accelerator can increase the performance of an applica-tion on the whole even if the offloaded code cannot utilise the device most efficiently,provided that the overlap of execution of the accelerator and the CPU is possible.

The second goal of the project was to provide the new insights into the ability ofOpenACC to provide enough features for porting a complex code such as CASIM.The OpenACC specification was found to be mature enough to enable the porting buta suggestion was made to extend the private clause to explicitly support the Fortranallocatable variables, a feature that would have reduced the CASIM’s porting timeconsiderably and is expected to benefit other complex programs. The Cray’s imple-mentation of OpenACC provided enough features to port CASIM but generated threecompiler bugs which were not found documented in public sources. Thus, this projectcontributes to the development of OpenACC by discovering the implementation bugsand limitations of the standard.

6.1 Future Work

Possibilities for further tuning of CASIM for the GPU hardware can be investigated.The current bottleneck is the warp stall1 for execution dependency2. This issue calls

1Situation when a warp must block while waiting for an instruction to finish or for some hardwareresource to become available

2Waiting for the previously issued instruction to complete

58

for the increase in the instruction level parallelism, however it is not clear how this canbe achieved with CASIM. The next performance bottleneck on the GPU is the warpstall for memory throttle. This issue may be solved by coalescing memory accessesin CASIM. However, the stall-time share of this issue is small while the extent of theinduced refactoring is expected to be high.

Only the warm microphysics processes were covered in the current OpenACC port ofCASIM because these processes are of the most scientific interest for the Met Office.The code for the cold microphysics processes may be ported to OpenACC as well, thuscovering the entire microphysics logic.

The code may be further optimised for the OpenACC’s accelerator model. The cacheOpenACC directive may be introduced in an attempt to optimise the memory accesspattern of CASIM. On GPUs, this directive is expected to trigger the use of the sharedmemory which is expected to improve the kernel’s performance.

59

Appendix A

List of the source code files changed bythis project

The project was working on the existing code base of CASIM and MONC. Submittedare the full versions of the codes as downloaded from the SVN repository. Both theoriginal and the modified versions of both programs are submitted.

This appendix lists the files that were changed or added during this project.

A.1 Files modified in CASIM

Changed files in the code_accelerated/casim/r957_accelerated direc-tory:

makefilesrc/accretion.F90src/activation.F90src/adjust_deposition.F90src/aerosol_routines.F90src/aggregation.F90src/autoconversion.F90src/breakup.F90src/condensation.F90src/distributions.F90src/evaporation.F90src/graupel_embryo.F90src/graupel_wetgrowth.F90src/homogeneous_freezing.F90src/ice_accretion.F90src/ice_deposition.F90src/ice_melting.F90

60

src/ice_multiplication.F90src/ice_nucleation.F90src/initialize.F90src/lookup.F90src/m3_incs.F90src/micro_main.F90src/mphys_constants.F90src/mphys_parameters.F90src/mphys_switches.F90src/mphys_tidy.F90src/passive_fields.F90src/preconditioning.F90src/process_routines.F90src/qsat_casim_func.F90src/sedimentation.F90src/snow_autoconversion.F90src/special.F90src/sum_procs.F90src/sweepout_rate.F90src/thresholds.F90src/type_process.F90src/ventfac.F90src/which_mode_to_use.F90

A.2 Files modified in MONC

Added files in the code_accelerated/monc/r1071_casim_acc directory:

components/casim_join/makefilecomponents/casim_join/src/casim_join.F90components/casim_join/src/casim_join_stub.F90

Changed files in the code_accelerated/monc/r1071_casim_acc directory:

components/casim/src/casim.F90model_core/src/components/registry.F90model_core/src/monc.F90

61

Bibliography

[1] the UK Met Office Jonathan Wilkinson. Timing Tests for the CASIM Micro-physics Scheme. Technical Paper. 2015.

[2] Alexandr Nigay. MSc in High Performance Computing, Project Preparation coursereport. Course report. The University of Edinburgh, 2016.

[3] Matthew Norman et al. “A case study of CUDA FORTRAN and OpenACC foran atmospheric climate kernel”. In: Journal of Computational Science 9 (2015).Computational Science at the Gates of Nature, pp. 1 –6. ISSN: 1877-7503. DOI:http://dx.doi.org/10.1016/j.jocs.2015.04.022. URL:http://www.sciencedirect.com/science/article/pii/S1877750315000605.

[4] Angus Lepper. “Accelerator weather forecasting”. MSc in High PerformanceComputing. EPCC, University of Edinburgh, 2015. URL: https://static.ph.ed.ac.uk/dissertations/hpc-msc/2014-2015/Accelerator%20weather%20forecasting.pdf.

[5] Mark Govett, Jacques Middlecoff, and Tom Henderson. “Directive-based Paral-lelization of the NIM Weather Model for GPUs”. In: Proceedings of the FirstWorkshop on Accelerator Programming Using Directives. WACCPD ’14. NewOrleans, Louisiana: IEEE Press, 2014, pp. 55–61. ISBN: 978-1-4799-7023-0.DOI: 10.1109/WACCPD.2014.9. URL: http://dx.doi.org/10.1109/WACCPD.2014.9.

[6] UK Met Office. Numerical Models in Meteorology. URL: http : / / www .metoffice.gov.uk/research/modelling-systems/numerical-models (visited on 07/26/2016).

[7] C.-H. Moeng and P.P. Sullivan. “Large-Eddy Simulation”. In: Encyclopedia ofAtmospheric Sciences. Ed. by Gerald R. North, John Pyle, and Fuqing Zhang.Second Edition. Oxford: Academic Press, 2015, pp. 232 –240. ISBN: 978-0-12-382225-3. DOI: http://dx.doi.org/10.1016/B978-0-12-382225-3.00201-2. URL: http://www.sciencedirect.com/science/article/pii/B9780123822253002012.

[8] UK Met Office. Met Office Unified Model. URL: http://www.metoffice.gov.uk/research/modelling-systems/unified-model (visitedon 07/26/2016).

[9] Hugh Morrison. An overview of cloud and precipitation microphysics and its pa-rameterization in models. 2010. URL: http://www2.mmm.ucar.edu/

62

http://dx.doi.org/http://dx.doi.org/10.1016/j.jocs.2015.04.022

http://www.sciencedirect.com/science/article/pii/S1877750315000605

http://www.sciencedirect.com/science/article/pii/S1877750315000605

https://static.ph.ed.ac.uk/dissertations/hpc-msc/2014-2015/Accelerator%20weather%20forecasting.pdf



http://dx.doi.org/10.1109/WACCPD.2014.9



http://www.metoffice.gov.uk/research/modelling-systems/numerical-models



http://dx.doi.org/http://dx.doi.org/10.1016/B978-0-12-382225-3.00201-2

http://dx.doi.org/http://dx.doi.org/10.1016/B978-0-12-382225-3.00201-2

http://www.sciencedirect.com/science/article/pii/B9780123822253002012

http://www.sciencedirect.com/science/article/pii/B9780123822253002012

http://www.metoffice.gov.uk/research/modelling-systems/unified-model

http://www.metoffice.gov.uk/research/modelling-systems/unified-model

http://www2.mmm.ucar.edu/wrf/users/workshops/WS2010/presentations/Lectures/morrison_wrf_workshop_2010_v2.pdf


wrf / users / workshops / WS2010 / presentations / Lectures /morrison_wrf_workshop_2010_v2.pdf.

[10] Marat Khairoutdinov and Yefim Kogan. “A New Cloud Physics Parameteriza-tion in a Large-Eddy Simulation Model of Marine Stratocumulus”. In: MonthlyWeather Review 128.1 (2000), pp. 229–243. DOI: 10.1175/1520-0493(2000)128<0229:ANCPPI> 2.0.CO;2. eprint: http://dx.doi.org/10.1175/1520-0493(2000)128<0229:ANCPPI>2.0.CO;2. URL:http://dx.doi.org/10.1175/1520-0493(2000)128<0229:ANCPPI>2.0.CO;2.

[11] Nicholas Brown et al. “A highly scalable Met Office NERC Cloud model”. In:Proceedings of the 3rd International Conference on Exascale Applications andSoftware. University of Edinburgh, July 2015. ISBN: 978-0-9926615-1-9.

[12] B. J. Shipway and A. A. Hill. “Diagnosis of systematic differences between mul-tiple parametrizations of warm rain microphysics using a kinematic framework”.In: Quarterly Journal of the Royal Meteorological Society 138.669 (2012), pp. 2196–2211. ISSN: 1477-870X. DOI: 10.1002/qj.1913. URL: http://dx.doi.org/10.1002/qj.1913.

[13] David C. Leon et al. “The Convective Precipitation Experiment (COPE): Investi-gating the Origins of Heavy Precipitation in the Southwestern United Kingdom”.In: Bulletin of the American Meteorological Society 97.6 (2016), pp. 1003–1020.DOI: 10.1175/BAMS-D-14-00157.1. eprint: http://dx.doi.org/10.1175/BAMS-D-14-00157.1. URL: http://dx.doi.org/10.1175/BAMS-D-14-00157.1.

[14] Swiss National Supercomputing Centre. Piz Daint. URL: http://www.cscs.ch/computers/piz_daint_piz_dora/index.html (visited on08/16/2016).

[15] NVIDIA Corporation. NVIDIA’s Next Generation CUDA Compute Architecture:Kepler TM GK110. Whitepaper. 2012. URL: https://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf.

[16] NVIDIA Corporation. NVIDIA Tesla K-Series Datasheet. Datasheet. 2013. URL:http://www.nvidia.com/content/tesla/pdf/nvidia-tesla-kepler-family-datasheet.pdf.

[17] NVIDIA Corporation. CUDA Samples, Utilities, deviceQuery. URL: http://docs.nvidia.com/cuda/cuda-samples/#device-query (visitedon 08/12/2016).

[18] OpenACC-standard.org. The OpenACC Application Programming Interface Ver-sion 2.0. Corrected, August, 2013. URL: http://www.openacc.org/sites/default/files/OpenACC.2.0a_1.pdf (visited on 08/01/2016).

[19] Oliver Fuhrer et al. “Towards a Performance Portable, Architecture Agnostic Im-plementation Strategy for Weather and Climate Models”. In: Supercomput. Front.Innov.: Int. J. 1.1 (Apr. 2014), pp. 45–62. ISSN: 2409-6008. DOI: 10.14529/jsfi140103. URL: http://dx.doi.org/10.14529/jsfi140103.

63




http://dx.doi.org/10.1175/1520-0493(2000)128<0229:ANCPPI>2.0.CO;2






http://dx.doi.org/10.1002/qj.1913



http://dx.doi.org/10.1175/BAMS-D-14-00157.1





http://www.cscs.ch/computers/piz_daint_piz_dora/index.html

http://www.cscs.ch/computers/piz_daint_piz_dora/index.html

https://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf



http://www.nvidia.com/content/tesla/pdf/nvidia-tesla-kepler-family-datasheet.pdf

http://www.nvidia.com/content/tesla/pdf/nvidia-tesla-kepler-family-datasheet.pdf

http://docs.nvidia.com/cuda/cuda-samples/#device-query

http://docs.nvidia.com/cuda/cuda-samples/#device-query

http://www.openacc.org/sites/default/files/OpenACC.2.0a_1.pdf

http://www.openacc.org/sites/default/files/OpenACC.2.0a_1.pdf

http://dx.doi.org/10.14529/jsfi140103



[20] OpenACC-standard.org. The OpenACC Application Programming Interface Ver-sion 2.5. October, 2015. URL: http://www.openacc.org/sites/default/files/OpenACC_2pt5.pdf (visited on 08/01/2016).

[21] NVIDIA Corporation. PTX ISA :: CUDA Toolkit Documentation, Parallel ThreadExecution ISA Version 4.3. URL: http://docs.nvidia.com/cuda/parallel-thread-execution/ (visited on 06/28/2016).

[22] NVIDIA Corporation. Profiler, CUDA Toolkit Documentation, Visual Profiler.URL: http://docs.nvidia.com/cuda/profiler-users-guide/index.html#visual-profiler.

[23] TOP500.org. The Top500 list for June 2016. URL: https://www.top500.org/lists/2016/06/ (visited on 08/17/2016).

[24] Jack Dongarra. Report on the Sunway TaihuLight System. Tech Report. Univer-sity of Tennessee, June 24, 2016. URL: http://www.netlib.org/utk/people/JackDongarra/PAPERS/sunway-report-2016.pdf.

[25] NVIDIA Corporation. Profiler :: CUDA Toolkit Documentation, nvprof overview.URL: http://docs.nvidia.com/cuda/profiler-users-guide/index.html#nvprof-overview (visited on 08/12/2016).

[26] NVIDIA Corporation. Multi-Process Service. Manual. 2015. URL: https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf.

64

http://www.openacc.org/sites/default/files/OpenACC_2pt5.pdf

http://www.openacc.org/sites/default/files/OpenACC_2pt5.pdf

http://docs.nvidia.com/cuda/parallel-thread-execution/

http://docs.nvidia.com/cuda/parallel-thread-execution/

http://docs.nvidia.com/cuda/profiler-users-guide/index.html#visual-profiler

http://docs.nvidia.com/cuda/profiler-users-guide/index.html#visual-profiler

https://www.top500.org/lists/2016/06/

https://www.top500.org/lists/2016/06/

http://www.netlib.org/utk/people/JackDongarra/PAPERS/sunway-report-2016.pdf

http://www.netlib.org/utk/people/JackDongarra/PAPERS/sunway-report-2016.pdf

http://docs.nvidia.com/cuda/profiler-users-guide/index.html#nvprof-overview

http://docs.nvidia.com/cuda/profiler-users-guide/index.html#nvprof-overview

https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf



Accelerating the microphysics model CASIM using OpenACC · PDF fileAccelerating the...

Documents

Transcript of Accelerating the microphysics model CASIM using OpenACC · PDF fileAccelerating the...