Planned use of 2012 HPC Resource - Weather and climate change

24
© Crown copyright Met Office Planned use of 2012 HPC Resource MOSAC 15.5 – Nov 2010

Transcript of Planned use of 2012 HPC Resource - Weather and climate change

© Crown copyright Met Office

Planned use of 2012 HPC Resource MOSAC 15.5 – Nov 2010

Presentation overview

• IBM Power 6

• Resource Breakdown

• Operational Constraints

• Scalability Constraint

• Optimisation Successes

• IBM Phase 2

• What is it? When do we expect it?

• Performance expectations.

• Plans for using future capacity

• Operational NWP Resource Need

• HPC needs to deliver Climate strategy

© Crown copyright Met Office

© Crown copyright Met Office

Weather & Climate Sites in Top500 HPC ListJun 2010

Rank Site Manufacturer Country Cores Peak (TF) 1 Oak Ridge Cray XT5 & Xt4. USA 255138 2591

KMA Cray XT6 Korea ~600 16 University of Edinburgh Cray XT6m & XT4 UK 66316 575 39 ECMWF IBM P6 UK 16640 313 14 Forschungszentrum Juelich (FZJ) Bull/Sun (Intel) Germany 26304 308

CPTEC - INPE Cray XT6 Brazil ~240 61 Naval Oceanographic Office - NAVO MSRC Cray XT5 & IBM P6 USA 18173 219 78 NCEP IBM P6 USA 9984 188 41 DKRZ IBM P6 Germany 8064 152

104 Met Office IBM P6 UK 7040 132 37 Agency for Marine -Earth Science & Tech NEC SX9 Japan 1280 131 90 NCAR IBM P6 USA 4064 76

107 Meteorological Research Institute (MRI) Hitachi Japan 3872 73 94 Indian Institute of Tropical Meteorology IBM P6 India 3744 70

113 Bureau of Meteorology / CSIRO Sun (Intel) Australia 4600 54 434 NIWA IBM P6 New Zealand 1792 34 465 Environment Canada IBM P5 Canada 4096 31

Resource Allocation – Aug 2010

Modelling System Fraction of Total resource

Climate (R&D & Production) 49.6%

NWP R&D 37.3%

Non-Climate Production 13.1% 3:1 ratio

UK 3.6%

Global 3.7%

Seasonal 2.1%

MOGREPS 1.3%

NAE 1.3%

Other LAMs eg Defence 0.9%

Wet Models 0.2% © Crown copyright Met Office

Operational Constraints

• Protecting climate share when Parallel Suite is also running

• Half cluster limit - total

• Delivery constraints

• Many Operational Configurations => Forces overlap of suites => quarter cluster limit per configuration

• Customers require end-product as soon as possible after data time => Maximum run length ~60 minutes (for all except seasonal forecasts)

© Crown copyright Met Office

© Crown copyright Met Office

Daily Usage Profile :Production NWP

Optimisation successes

• HadGEM3 (Glosea4) expanded from 38/42 level set to 85/75 level set with no additional resource need – OpenMP plus improved coupling.

• Algorithmic changes to routines (QPOS, MSLP) which impede scalability. Reduce the need for globalcommunications.

• IO Server supports asynchronous STASH output allowing the integration to proceed without interruption.

• And many more… • 35 separate UM optimisations in 18 months • Improvements in other applications by as much as Factor 10

© Crown copyright Met Office

© Crown copyright Met Office

UM ScalabilityLimited Area Models (UKV)

Production NWP

© Crown copyright Met Office

UM Scalability :Global (25km)

Production NWP - now Production NWP - next

© Crown copyright Met Office

UM Scalability :Impact of increased resolution

© Crown copyright Met Office

Ensemble DA

0

200

400

600

800

1000

1200

1400

1600

400 600 800 1000 1200 1400 1600

Cores

Elap

sed

time

(s) Perfect VAR VAR N216 EnDA N216 48 members

Project Phase 2 system size

Met Office MONSooN Phase 1

IBM Power6

Phase 2 IBM

Power 7

Factor Phase 1 Phase 2

Number of Cores

6784 26624 3.9 960 2048

Peak TFlops

127 852 6.7 18 65

© Crown copyright Met Office

Benchmark Expectations

UM Atm Climate N96

UM UK 1.5

UM Atm Global N512

4DVAR DA

NEMO Ocean

Copies/ Cluster 64 2 4 4 16

Speedup (*) 4.3 3.3 3.5 2.4 2.6

Copies/ Cluster 128 4 8 8 32

Speedup (*) 4.8 3.9 4.0 2.6 2.6

(*) Ratio of P6 wall-clock time/ P7 wall clock time Contracted speedup measure for these benchmarks determines system size

© Crown copyright Met Office

© Crown copyright Met Office

Projected Scalability

Cores

Spe

edup

Development Timetable

• April 2011: IBM Factory testing complete

• June 2011: Test System installed

• Sep 2011: Main cluster ready to install codes

• Oct-Nov 2011: Acceptance testing andMigration of Production systems

• Dec 2011: Go live and prepare todecommission Power 6

• Jan 2012: Commission MONSooN upgrade

© Crown copyright Met Office

Tackling the scalability limit( in the short-term)

Optimisation team ( 4 Met Office and 2 IBM analysts) has delivered >20% improvement during past 18 months (10% = ~£1m pa)

Continued scope for optimisation in the short-term

• Incremental code improvements to improve both single node performance and scalability.

• Next generation systems (IBM Power7, Cray XE6, best in class Intel clusters…) deliver benefits e.g. single core design, interconnect latency/bandwidth, Better Compiler / Message Passing Library / OpenMP.

• Algorithmic changes eg ENDGAME Polar Grid changes will give efficiency for solver

• Weak scalability (ie increasing problem size enabling effective use of more cores) gets us some way

The UM might be able to make effective use of 10K cores for big problems © Crown copyright Met Office

Weak Scalability

• Core count chosen so as to deliver results in same time.

• Weak scalability good if cost unchanged (bar height 1)

• Time step factor inhibits weakscalability. We cannot parallelise temporally!

• Memory bandwidth inhibits the smallest N96 configuration (only 8 cores needed)

• Still not optimal weak scalability with timestep factor removed.

© Crown copyright Met Office

NWP Plans

Mar 2012: • ~40km MOGREPS-G

• 12km MOGREPS-R (with new domain)

• Redesign MOGREPS for 4runs/day @ 12 members

• 1.5km MOGREPS-UK (~4 members)

• Annual Upgrade of Global Model Physics

Jun 2012 • Next Gen Global DA

• Trial NWP Nowcast (for London 2012)

• NAE replaced by MOGREPS-R

Nov 2012 • Global Horizontal Resolution Change (with New dynamical core?)

© Crown copyright Met Office

© Crown copyright Met Office

Resource

¼ cluster

<¼ cluster

¼ cluster

¼ cluster

NWP Suite Design Flexibility

Global Deterministic Suite • target horizontal resolution dependent on ENDGAME

Global EPS • some flexibility with horizontal resolution, • 6 hour cycle @ 12 members for long runs spreads peak

load, • some flexibility with number of members running to T+6 for

hybrid DA

Regional EPS • initial data time offset by 3 hours with 6 hour cycle @ 12

members to spread peak load. • Resolution fixed at 12km but some flexibility with domain

size

UK EPS • Domain and resolution fixed, some flexibility with number of

members © Crown copyright Met Office

Beddington Review

• Supercomputer requirement driven by the need to answer key climate questions asap

• In the shorter term require

• Significant (4x) increase in computing

• Further investment in collaboration infrastructure

• Additional resources to improve model scalability

• For longer term

• Proactive engagement to establish a European climate supercomputing infrastructure

© Crown copyright Met Office

Climate Plans

Climate model name

Horizontal and Vertical resolution Moore’s Law lead time (yrs)

Atmosphere Ocean

HadGEM2-ES 135km, 38L 72km, 40L 0

HadGEM3-ES 135km, 85L 72km, 75L 4

HadGEM3 135km, 200L 17km, 75L 11

HadGEM3 60km, 85L 17km, 75L 7

UK HadGAM3 1.5km, 70L UK domain 12

•Strategy supported by Govt Chief Scientist • 4x increase in capacity relative to plan would be ideal • Compromise bid for 50% increase would provide significant extra benefit in terms of early answers to key questions

© Crown copyright Met Office

Summary

• Significant improvements in UM scalability following optimisations on Power 6

• Power 7 + further optimisations holds out prospect for further step change in scalability

• Allows an aggressive plan for model/ensemble upgrades

• Case for significant increase in climate supercomputing made as part of Beddingtonreview

© Crown copyright Met Office

Questions, answers & discussion

© Crown copyright Met Office