LFRic: Developing the next generation atmospheric mode for ... Christopher.pdfHorizon 2020...
Transcript of LFRic: Developing the next generation atmospheric mode for ... Christopher.pdfHorizon 2020...
LFRic: Developing the next generation atmospheric mode for the MetofficeDr Christopher Maynard, Met Office27 September 2017
LFRic AcknowledgementsS. Adams, M. Ashworth, T. Benacchio, R. Ford, M.
Glover, M. Guidolin, M. Hambley, M. Hobson, I. Kavcic, C.M, T. Melvin, E. Mueller, S. Mullerworth, A. Porter, S. Pring, M. Rezny, G. Riley, S. Sandbach, R.
Sharp, B. Shipway, K. Sivalingam, P. Slavin, J. Thuburn, R. Wong, N. Wood
Met Office HPC machinesCray XC40 machines
Phase 1a August 2015 2x 610 Intel Haswell nodesPhase 1b May 2016 2x 2496 Intel Broadwell nodesPhase 1c Q1 2017 1x 6720 Intel Broadwell nodes
5 machines460, 672 compute cores ~ ½ million coresInput power 5.8 MW
96 node Intel Xeon Phi (Cray) systemKnights Landing (KNL) (MO)Isambard: GW4+MO (Cray systemNVIDIA GPU (Pascal), KNL and 10k+ ARMv8 cores (eventually)
3 MW
Gung Ho – new DycoreCubed Sphere no singlular poles lon-latUnstructured mesh can use other meshesMixed finite element scheme – C-GridExterior calculus mimetic propertiesSemi-implicit in time
Scientific aspects - GungHo
www.metoffice.gov.uk © Crown Copyright 2017, Met Office
Mixed finite element discretization of nonhydrostatic compressible equations
Flexibility in principle on type of grid and accuracy by using higher-order polynomials Targeting cubed-sphere grid and lowest-order elements (similar to C-grid) Compatibility carries over a number of desirable properties (e.g. mass conservation)
from the continuous equations to their discretized formTransport
COSMIC: direction-split scheme for mass, temperature and tracers In flux form, ensures local conservation
Time discretization: semi-implicit iterative scheme (as in ENDGame)
Started work on coupling existing UM physics paramterizations
Results
www.metoffice.gov.uk © Crown Copyright 2017, Met Office
Cold air bubble on Cartesian domain
ENDGame
Baroclinic flow on the sphere
GungHo
Layered architecture - PSyKAl
Alg layer – high level expression of operations on global fieldsKernel layer – low level Explicit operation on a single column of dataCode has to follow set of rules (PSyKAl API is DSL)Parallelisation SystemHorizontal looping and parallel code
PsyclonePython code generatorParser, transformations, generationControls parallel code (MPI/OpenMP and OpenACC)
Developed at STFC HartreeR. Ford and A. PorterWorks with PSyKAl API
Strong scaling
Global volume Fixed Time-step fixed for finest reolution (8 million cells horizontalSolid - Pure MPIHatched is 4 MPI tasks per node, 2 per NUMA (socket) 9 threads6144 nodes – 221118 coresscaled earth 1/1020 levels
Intel15 has thread contention issuesESMF only has 32 bit integers (6x6 for 6144 nodes)
Scaling Results
Based on 2017 CSAR testsBaroclinic wave testNaïve solver short timestepLowest order FEM, cubed sphere full size earth30 levels, 30km lidDynamics plus flux based advection schemeAdvection not part of PSyKAl API not autogenerated, no OpenMP
Mixed Mode – MPI + OpenMP6 (bdw) XC40 nodesWall-clock timeC96L30 FEM low orderBaroclinic testFull size EarthNaïve solver precondshort time-step (100)C96 – 30 levelsReduced wall clock
setup and pow() subtracted
216 MPI task 12
MV and Jacobian scaling
Individual kernels show excellent OpenMPperformanceUnstructured mesh and cell looping colouringRedundant computation no special cases for commsHalos need updatesBigger MPI patch + threads less halo
Redundant Computation
owned cellhalo cell
Dof living on shared (partitioned) entity (edge).Receive contribution from owned and halo cell.Redundant compute contribution in halo to shared dof.Less comms, no special cases
MPI only, 4 MPI ranks all have halosHybrid, 1 MPI task has a halo, 4 OpenMP threads share haloboundary-to-area scaling Less work for OpenMP threads
rank 0 rank 1
rank 2 rank 3
rank 0
thread 0
thread 2
thread 1
thread 3
Strong Scaling – Baroclinic test
36 MPI per node 6 MPI /6 OMP threads
Total run-timeLV=322 LV=242 LV=162 LV=122
OpenMP and MPIBoth show good scalingMixed mode is slightly faster and scales betterSome parts of code have no OpenMP(Advection scheme still being developed)Not yet supported in PSyKAl APIFuture workMPI is faster than hybrid for these sectionsIntel OpenMP runtime incur 20-40s shutdown penalty! – Yuk!Look at individual kernels and comms
Matrix-vector strong scaling
36 MPI per node 6 MPI /6 OMP threads
LV=322 LV=242 LV=162 LV=122
Global Sum
OpenMP work share across dofsfor local sum
global sum scalarESMFMPI_allreduce
Global Sum ReproducibilityOpenMP reduce is does NOT reproduce same answer with the SAME number of threads.Accumulate each thread separately, the serial sum contribution for each thread(Pad arrarys to avoid false sharing)Same number of threads same answer
Distributed sum is for single scalar, latency bound double-double decomposition reproducible sum is more expensiveDecomposition reproducibility for mixed mode is difficult AND expensiveWe do not support this
Global SumSum local data then call mpi_allreduce on a scalar via ESMFGlobal comms scales with number of MPI ranks (latency).Cray Aries network adaptive routing. Maximises use of network links – depends network traffic.More OMP threads means less MPI ranksGlobal sum is faster?
8 measurements
Local communicationHalo Exchange is nearest neighbour, scales with data size (bandwidth).More OMP threads, less MPI ranks bigger halo, fewer messages, more data.Fewer, bigger messages has good MPI performance.No performance penalty for OMP versionLocal volume per parallel thread (T1,T6)In principle, overlap commsand compute
4 measurementsDepth = 1 halos
122122
162
162242 242
322
322
Scaling - summary
Can run 221118 cores – The PSyKAl API worksTest scaling to 13K/55K – scales wellIndividual kernels show excellent scalingRedundant computation into the halo OpenMP winsMostly still solver (Naive pre-conditioner) lots of global sums (see this in the profile)Implementing Column Matrix Assembly and Multi-Grid
this year (E Mueller, Bath Uni)Better solver formulation longer time-stepBetter colouring algorithmFused kernels for built-ins now supported in PSyClone
Computer Architectures and programming models
Open ACCOffload data and kernel, same logic as OpenMPIdeally have bigger data regions
need to annotate kernel sourceSIMD (vector/warp) level parallelismPsyclone cannot yet do this
The EuroExa project and LFRicHorizon 2020 FETHPC-01-2016:
Co-design of HPC systems and applications EuroExa started 1st Sep 2017, runs for 3½ years16 Partners, 8 countries, €20MBuilds on previous projects, esp. ExaNoDe, ExaNeSt, EcoScaleAim: design, build, test and evaluate an Exascale prototype system Architecture based on ARM CPUs with FPGA acceleratorsThree testbed systems: testbed #3 will be ~1 Pflop/sLow-power design goal to target realistic Exascale systemArchitecture evolves in response to app requirements = co-designWide range of apps, including LFRic, NEMO(STFC) and IFS(ECMWF)
https://insidehpc.com/2017/09/announcing-euroexa-project-exascale/
LFRic work at University of Manchester (MA & GR)• Kernels e.g. matrix-vector LFRic mini-apps LFRic full code• Use Xilinx Vivado HLS and/or Maxeler compiler to generate code for the FPGAs from
kernels
Kick-off meeting 4th-5th Sep 2017, Barcelona
PSYKEPsyclone Kernel Extractor:Captures and dumps LFRic runtime data
Run kernels in isolation.Sergi Siso IPCC Hartree centre Code optimisations for LMA matrix-vector (KNL)LMA CMA don’t want invasive code change ... yetUse Isambard to compare different things.
KNL single node – MV LONo MPI, shared memory onlySmall problemC12 – 864 cells, 40 levelsnon-optimal colouring4 + 2 small ones … yukCode opt has good benefit for Intel compiler(it doesn’t vectorisewithout !$omp simd)Cray compiler – even better
KNL Cray compiler MV LOCray compiler already doing a good job at vectorisingDon’t need code optimisation
It can also compile OpenACC
crayftn ftn -ra code opt
No code opt
MV Comparison - LOPAS – Pascal NVIDIA GPUNot including data transferToo many coloursNot enough levelsGPU really likes alternative loop order!
MV LO comparison IIComparing 3 architectures, two compilers and a code optimisation.Something for everyone in this plot!
MV HO comparisonChange of loop order doesn’t work for Higher order on any architecture!
Baroclinic test 6 KNL nodes
Full code – time for matrix-vectorRun BDW binary – no AVX512Cache mode / Quadrant(Flat mode is no faster problem fits in cache)Fully populated nodesM = MPI ranksT = OMP threadsMore threads is betterEven over-subscribed (2 and 4 threads per core).
Baroclinic test 6 KNL nodes
Recompile for KNLA bit fasterOSC is worse (??)
Baroclinic test 6 KNL nodes
Code Optimisationno effect on BDWbig effect on KNL Intel compiler
OSC is worse(??)
Volume dependenceStill matrix-vectorChange problem sizeSeems to scale linearly.hatched is OSCBlue is twice number of levelsTwice as much work per threadMore data in vector
122
162
182
162
122
Conclusions
Gung Ho removes algorithmic bottleneckLFRic employs domain specific language approachDeveloping PSyKAl API and PSyClone code generatorScience code separate from parallelisation/optimisationChange science code – kernels, algorithms, FEM order, Mesh – Parallelism auto-generatedChange programming model/optimisation – no change to science codeTarget different hardware, ARM, GPU, Xeon Phi-- optimisation space is large
Need to generate optimisations from single science source
LFRic IO Work package 2016/2017XIOS-LFRic integration
•IPSL delivered parallel UGRID output (Sept 2016)•Development work to output prognostic scalar and vector fields as UGRID (3D layered topology)•Preliminary visualisation tools (regridding and basic plots using Met Office Iris)•Refactoring of current LFRic IO and addition of XIOS option (currently in work and nearly complete)
•Preliminary performance evaluation •On Met Office XCS machine. (Jan-Feb 2016)•Investigated:
•Scalability•Interaction with Lustre file system•Diagnostic ‘loading’
Performance highlightsXIOS appears to handle massively parallel runs well
Tested out to 2304 nodes of our XCS machine (~80K processors)
Scaling (w.r.t. runtime) appears to be good
•Up to 384 nodes (13824 proc) runtimes within 10% of jobs without IO. I believe this is within natural variability of runtime on XCS.
Results for 288x288 mesh (approx. 35Km resolution)XIOS ‘client ratio’ is % time client spent waiting
•On very large runs (1152 and 2304 nodes) we do see some significant increase in runtime compared to non-IO jobs. However, no tuning has been done for these sizes, so further work needed to find optimal setup.
Performance highlightsPreliminary tests with Lustre striping shows if we set it, can get good performance with low number of IO servers
With default LFS (unstriped), 96 node, 3456 processors, 36 XIOS servers
•Client ratio 0.377 % Write gaps 234•With stripe count = 36, 96 node, 3456 processors, 36 XIOS servers
•Client ratio 0.482 % Write gaps 95•With stripe count set best result was with 12 XIOS servers
•Client ratio 0.292 % Write gaps 67Diagnostic loading tests show no issue with output of large numbers of fields to one UGRID file (245 fields, 273Gb)•Full results available on LFRic SRS wiki:https://code.metoffice.gov.uk/trac/lfric/wiki/LFRic/XIOSPerf
Algorithm Layerinvoke() Do this in parallelkernels single column operations fields data parallel global fields
Multiple kernels in single invoke scope of ordering/parallel communication, etc
Kernel Layer
metadata
code
do k=0, nlayers -1
Psyclone transformations
Generated PSy layerUpdate halos ESMFMPIcolouring from infrastructure
OpenMPworkshare across cells in colour
kernel call for single column. Args are arrays and scalars