LFRic: Developing the next generation atmospheric mode for ... Christopher.pdfHorizon 2020...

LFRic: Developing the next generation atmospheric mode for the MetofficeDr Christopher Maynard, Met Office27 September 2017

LFRic AcknowledgementsS. Adams, M. Ashworth, T. Benacchio, R. Ford, M.

Glover, M. Guidolin, M. Hambley, M. Hobson, I. Kavcic, C.M, T. Melvin, E. Mueller, S. Mullerworth, A. Porter, S. Pring, M. Rezny, G. Riley, S. Sandbach, R.

Sharp, B. Shipway, K. Sivalingam, P. Slavin, J. Thuburn, R. Wong, N. Wood

Met Office HPC machinesCray XC40 machines

Phase 1a August 2015 2x 610 Intel Haswell nodesPhase 1b May 2016 2x 2496 Intel Broadwell nodesPhase 1c Q1 2017 1x 6720 Intel Broadwell nodes

5 machines460, 672 compute cores ~ ½ million coresInput power 5.8 MW

96 node Intel Xeon Phi (Cray) systemKnights Landing (KNL) (MO)Isambard: GW4+MO (Cray systemNVIDIA GPU (Pascal), KNL and 10k+ ARMv8 cores (eventually)

3 MW

Gung Ho – new DycoreCubed Sphere no singlular poles lon-latUnstructured mesh can use other meshesMixed finite element scheme – C-GridExterior calculus mimetic propertiesSemi-implicit in time

Scientific aspects - GungHo

www.metoffice.gov.uk © Crown Copyright 2017, Met Office

Mixed finite element discretization of nonhydrostatic compressible equations

Flexibility in principle on type of grid and accuracy by using higher-order polynomials Targeting cubed-sphere grid and lowest-order elements (similar to C-grid) Compatibility carries over a number of desirable properties (e.g. mass conservation)

from the continuous equations to their discretized formTransport

COSMIC: direction-split scheme for mass, temperature and tracers In flux form, ensures local conservation

Time discretization: semi-implicit iterative scheme (as in ENDGame)

Started work on coupling existing UM physics paramterizations

Results

www.metoffice.gov.uk © Crown Copyright 2017, Met Office

Cold air bubble on Cartesian domain

ENDGame

Baroclinic flow on the sphere

GungHo

Layered architecture - PSyKAl

Alg layer – high level expression of operations on global fieldsKernel layer – low level Explicit operation on a single column of dataCode has to follow set of rules (PSyKAl API is DSL)Parallelisation SystemHorizontal looping and parallel code

PsyclonePython code generatorParser, transformations, generationControls parallel code (MPI/OpenMP and OpenACC)

Developed at STFC HartreeR. Ford and A. PorterWorks with PSyKAl API

Strong scaling

Global volume Fixed Time-step fixed for finest reolution (8 million cells horizontalSolid - Pure MPIHatched is 4 MPI tasks per node, 2 per NUMA (socket) 9 threads6144 nodes – 221118 coresscaled earth 1/1020 levels

Intel15 has thread contention issuesESMF only has 32 bit integers (6x6 for 6144 nodes)

Scaling Results

Based on 2017 CSAR testsBaroclinic wave testNaïve solver short timestepLowest order FEM, cubed sphere full size earth30 levels, 30km lidDynamics plus flux based advection schemeAdvection not part of PSyKAl API not autogenerated, no OpenMP

Mixed Mode – MPI + OpenMP6 (bdw) XC40 nodesWall-clock timeC96L30 FEM low orderBaroclinic testFull size EarthNaïve solver precondshort time-step (100)C96 – 30 levelsReduced wall clock

setup and pow() subtracted

216 MPI task 12

MV and Jacobian scaling

Individual kernels show excellent OpenMPperformanceUnstructured mesh and cell looping colouringRedundant computation no special cases for commsHalos need updatesBigger MPI patch + threads less halo

Redundant Computation

owned cellhalo cell

Dof living on shared (partitioned) entity (edge).Receive contribution from owned and halo cell.Redundant compute contribution in halo to shared dof.Less comms, no special cases

MPI only, 4 MPI ranks all have halosHybrid, 1 MPI task has a halo, 4 OpenMP threads share haloboundary-to-area scaling Less work for OpenMP threads

rank 0 rank 1

rank 2 rank 3

rank 0

thread 0

thread 2

thread 1

thread 3

Strong Scaling – Baroclinic test

36 MPI per node 6 MPI /6 OMP threads

Total run-timeLV=322 LV=242 LV=162 LV=122

OpenMP and MPIBoth show good scalingMixed mode is slightly faster and scales betterSome parts of code have no OpenMP(Advection scheme still being developed)Not yet supported in PSyKAl APIFuture workMPI is faster than hybrid for these sectionsIntel OpenMP runtime incur 20-40s shutdown penalty! – Yuk!Look at individual kernels and comms

Matrix-vector strong scaling

36 MPI per node 6 MPI /6 OMP threads

LV=322 LV=242 LV=162 LV=122

Global Sum

OpenMP work share across dofsfor local sum

global sum scalarESMFMPI_allreduce

Global Sum ReproducibilityOpenMP reduce is does NOT reproduce same answer with the SAME number of threads.Accumulate each thread separately, the serial sum contribution for each thread(Pad arrarys to avoid false sharing)Same number of threads same answer

Distributed sum is for single scalar, latency bound double-double decomposition reproducible sum is more expensiveDecomposition reproducibility for mixed mode is difficult AND expensiveWe do not support this

Global SumSum local data then call mpi_allreduce on a scalar via ESMFGlobal comms scales with number of MPI ranks (latency).Cray Aries network adaptive routing. Maximises use of network links – depends network traffic.More OMP threads means less MPI ranksGlobal sum is faster?

8 measurements

Local communicationHalo Exchange is nearest neighbour, scales with data size (bandwidth).More OMP threads, less MPI ranks bigger halo, fewer messages, more data.Fewer, bigger messages has good MPI performance.No performance penalty for OMP versionLocal volume per parallel thread (T1,T6)In principle, overlap commsand compute

4 measurementsDepth = 1 halos

122122

162

162242 242

322

322

Scaling - summary

Can run 221118 cores – The PSyKAl API worksTest scaling to 13K/55K – scales wellIndividual kernels show excellent scalingRedundant computation into the halo OpenMP winsMostly still solver (Naive pre-conditioner) lots of global sums (see this in the profile)Implementing Column Matrix Assembly and Multi-Grid

this year (E Mueller, Bath Uni)Better solver formulation longer time-stepBetter colouring algorithmFused kernels for built-ins now supported in PSyClone

Computer Architectures and programming models

Open ACCOffload data and kernel, same logic as OpenMPIdeally have bigger data regions

need to annotate kernel sourceSIMD (vector/warp) level parallelismPsyclone cannot yet do this

The EuroExa project and LFRicHorizon 2020 FETHPC-01-2016:

Co-design of HPC systems and applications EuroExa started 1st Sep 2017, runs for 3½ years16 Partners, 8 countries, €20MBuilds on previous projects, esp. ExaNoDe, ExaNeSt, EcoScaleAim: design, build, test and evaluate an Exascale prototype system Architecture based on ARM CPUs with FPGA acceleratorsThree testbed systems: testbed #3 will be ~1 Pflop/sLow-power design goal to target realistic Exascale systemArchitecture evolves in response to app requirements = co-designWide range of apps, including LFRic, NEMO(STFC) and IFS(ECMWF)

https://insidehpc.com/2017/09/announcing-euroexa-project-exascale/

LFRic work at University of Manchester (MA & GR)• Kernels e.g. matrix-vector LFRic mini-apps LFRic full code• Use Xilinx Vivado HLS and/or Maxeler compiler to generate code for the FPGAs from

kernels

Kick-off meeting 4th-5th Sep 2017, Barcelona

https://insidehpc.com/2017/09/announcing-euroexa-project-exascale/

PSYKEPsyclone Kernel Extractor:Captures and dumps LFRic runtime data

Run kernels in isolation.Sergi Siso IPCC Hartree centre Code optimisations for LMA matrix-vector (KNL)LMA CMA don’t want invasive code change ... yetUse Isambard to compare different things.

KNL single node – MV LONo MPI, shared memory onlySmall problemC12 – 864 cells, 40 levelsnon-optimal colouring4 + 2 small ones … yukCode opt has good benefit for Intel compiler(it doesn’t vectorisewithout !$omp simd)Cray compiler – even better

KNL Cray compiler MV LOCray compiler already doing a good job at vectorisingDon’t need code optimisation

It can also compile OpenACC

crayftn ftn -ra code opt

No code opt

MV Comparison - LOPAS – Pascal NVIDIA GPUNot including data transferToo many coloursNot enough levelsGPU really likes alternative loop order!

MV LO comparison IIComparing 3 architectures, two compilers and a code optimisation.Something for everyone in this plot!

MV HO comparisonChange of loop order doesn’t work for Higher order on any architecture!

Baroclinic test 6 KNL nodes

Full code – time for matrix-vectorRun BDW binary – no AVX512Cache mode / Quadrant(Flat mode is no faster problem fits in cache)Fully populated nodesM = MPI ranksT = OMP threadsMore threads is betterEven over-subscribed (2 and 4 threads per core).


Recompile for KNLA bit fasterOSC is worse (??)


Code Optimisationno effect on BDWbig effect on KNL Intel compiler

OSC is worse(??)

Volume dependenceStill matrix-vectorChange problem sizeSeems to scale linearly.hatched is OSCBlue is twice number of levelsTwice as much work per threadMore data in vector

122

162

182

162

122

Conclusions

Gung Ho removes algorithmic bottleneckLFRic employs domain specific language approachDeveloping PSyKAl API and PSyClone code generatorScience code separate from parallelisation/optimisationChange science code – kernels, algorithms, FEM order, Mesh – Parallelism auto-generatedChange programming model/optimisation – no change to science codeTarget different hardware, ARM, GPU, Xeon Phi-- optimisation space is large

Need to generate optimisations from single science source

LFRic IO Work package 2016/2017XIOS-LFRic integration

•IPSL delivered parallel UGRID output (Sept 2016)•Development work to output prognostic scalar and vector fields as UGRID (3D layered topology)•Preliminary visualisation tools (regridding and basic plots using Met Office Iris)•Refactoring of current LFRic IO and addition of XIOS option (currently in work and nearly complete)

•Preliminary performance evaluation •On Met Office XCS machine. (Jan-Feb 2016)•Investigated:

•Scalability•Interaction with Lustre file system•Diagnostic ‘loading’

Performance highlightsXIOS appears to handle massively parallel runs well

Tested out to 2304 nodes of our XCS machine (~80K processors)

Scaling (w.r.t. runtime) appears to be good

•Up to 384 nodes (13824 proc) runtimes within 10% of jobs without IO. I believe this is within natural variability of runtime on XCS.

Results for 288x288 mesh (approx. 35Km resolution)XIOS ‘client ratio’ is % time client spent waiting

•On very large runs (1152 and 2304 nodes) we do see some significant increase in runtime compared to non-IO jobs. However, no tuning has been done for these sizes, so further work needed to find optimal setup.

Performance highlightsPreliminary tests with Lustre striping shows if we set it, can get good performance with low number of IO servers

With default LFS (unstriped), 96 node, 3456 processors, 36 XIOS servers

•Client ratio 0.377 % Write gaps 234•With stripe count = 36, 96 node, 3456 processors, 36 XIOS servers

•Client ratio 0.482 % Write gaps 95•With stripe count set best result was with 12 XIOS servers

•Client ratio 0.292 % Write gaps 67Diagnostic loading tests show no issue with output of large numbers of fields to one UGRID file (245 fields, 273Gb)•Full results available on LFRic SRS wiki:https://code.metoffice.gov.uk/trac/lfric/wiki/LFRic/XIOSPerf

Algorithm Layerinvoke() Do this in parallelkernels single column operations fields data parallel global fields

Multiple kernels in single invoke scope of ordering/parallel communication, etc

Kernel Layer

metadata

code

do k=0, nlayers -1

Psyclone transformations

Generated PSy layerUpdate halos ESMFMPIcolouring from infrastructure

OpenMPworkshare across cells in colour

kernel call for single column. Args are arrays and scalars

LFRic: Developing the next generation atmospheric mode for ... Christopher.pdfHorizon 2020...

Documents

Transcript of LFRic: Developing the next generation atmospheric mode for ... Christopher.pdfHorizon 2020...