Www.ccsm.ucar.edu Running CCSM Tony Craig CCSM Software Engineering Group [email protected].

41
www.ccsm.ucar.edu Running CCSM Running CCSM Tony Craig CCSM Software Engineering Group [email protected]

Transcript of Www.ccsm.ucar.edu Running CCSM Tony Craig CCSM Software Engineering Group [email protected].

Page 1: Www.ccsm.ucar.edu Running CCSM Tony Craig CCSM Software Engineering Group ccsm@ucar.edu.

www.ccsm.ucar.edu

Running CCSMRunning CCSM

Tony CraigCCSM Software Engineering Group

[email protected]

Page 2: Www.ccsm.ucar.edu Running CCSM Tony Craig CCSM Software Engineering Group ccsm@ucar.edu.

www.ccsm.ucar.edu

OutlineOutline

• General review of CCSM

• Setting up and running a simple case

• Datasets

• Production

• Modifying source code

• Errors

• Tools

• Performance

Page 3: Www.ccsm.ucar.edu Running CCSM Tony Craig CCSM Software Engineering Group ccsm@ucar.edu.

www.ccsm.ucar.edu

Review of CCSMReview of CCSM

• Five components / Ten models– Atmosphere(3) : atm, datm, latm– Ocean(2) : ocn, docn– Land(2) : lnd, dlnd– Ice(2+) : ice, ice (prescribed mode), ice (mixed

layer ocean mode), dice– Coupler(1) : cpl

• Communication via MPI between components and coupler only

• Each component runs on multiple processors via MPI, OpenMP, MPI/OpenMP

Page 4: Www.ccsm.ucar.edu Running CCSM Tony Craig CCSM Software Engineering Group ccsm@ucar.edu.

www.ccsm.ucar.edu

Component parallelizationComponent parallelization

• atm : MPI, OpenMP, or MPI/OpenMP• lnd : MPI, OpenMP, or MPI/OpenMP• Ice : MPI only• ocn : MPI only• cpl : OpenMP only• The data models, datm, docn, dice, dlnd, and

latm : serial only, 1 processor

Page 5: Www.ccsm.ucar.edu Running CCSM Tony Craig CCSM Software Engineering Group ccsm@ucar.edu.

www.ccsm.ucar.edu

ConfigurationsConfigurations

• A = datm, dlnd, docn, dice, cpl• B = atm, lnd, ocn, ice, cpl• C = datm, dlnd, ocn, dice, cpl• D = datm, dlnd, docn, ice, cpl• F = atm, lnd, docn, ice (prescribed mode), cpl• G = latm, dlnd, ocn, ice, cpl• H = atm, dlnd, docn, dice, cpl• I = datm, lnd, docn, dice, cpl• K = atm, lnd, docn, dice, cpl• M = latm, dlnd, docn, ice (ml ocn mode), cpl

Page 6: Www.ccsm.ucar.edu Running CCSM Tony Craig CCSM Software Engineering Group ccsm@ucar.edu.

www.ccsm.ucar.edu

ResolutionsResolutions

• atm/lnd/datm/dlnd = T42, T31

• ocn/ice/docn/dice = gx1v3, gx3, gx3v4

• latm = T62

• Scientifically validated combinations– B, T42_gx1v3 = b20.007 control run

(test.a1 case)– B, T31_gx3v4 = paleo control run (test.a2

case)

Page 7: Www.ccsm.ucar.edu Running CCSM Tony Craig CCSM Software Engineering Group ccsm@ucar.edu.

www.ccsm.ucar.edu

“Available” configurations“Available” configurations

A B C D F G H I K M

T42_gx1v3 * * * * * * * *T31_gx3 * * * * * * *T31_gx3v4 *T62_gx1v3 * *T62_gx3 * *

= supported (subject to change)

= b20.007 control

= paleo control

***

Page 8: Www.ccsm.ucar.edu Running CCSM Tony Craig CCSM Software Engineering Group ccsm@ucar.edu.

www.ccsm.ucar.edu

PlatformsPlatforms

• IBM

• SGI

• Compaq*

Page 9: Www.ccsm.ucar.edu Running CCSM Tony Craig CCSM Software Engineering Group ccsm@ucar.edu.

www.ccsm.ucar.edu

Review of scriptsReview of scripts

• Main script (test.a1.run)– Sets primary ccsm environment variables– Calls $model.setup.csh

• Gets input datasets• Builds components

– Runs model– Archives– Harvests

Page 10: Www.ccsm.ucar.edu Running CCSM Tony Craig CCSM Software Engineering Group ccsm@ucar.edu.

www.ccsm.ucar.edu

Setting up a simple caseSetting up a simple case

• Use the GUI !!– The GUI modifies the scripts and creates a new

case for you– Input $CASE, $CSMROOT, $CSMDATA,

$EXEROOT– Input resolution– Input configuration (A-M)– Sets processor layout based on configuration (first

guess)– Sets some batch environment variables– Works well in the NCAR environment, other sites

require post script-generation tuning

Page 11: Www.ccsm.ucar.edu Running CCSM Tony Craig CCSM Software Engineering Group ccsm@ucar.edu.

www.ccsm.ucar.edu

Setting up a simple case, without GUISetting up a simple case, without GUI

• Create new case directory under scripts, copy over test.a1 files

• Rename file test.a1.run to $CASE.run– Edit $CASE, $CSMROOT, $CSMDATA,

$EXEROOT, $ARCROOT– Edit batch environment parameters– Edit $GRID– Edit $SETUPS– Edit $NTASKS, $NTHRDS

Page 12: Www.ccsm.ucar.edu Running CCSM Tony Craig CCSM Software Engineering Group ccsm@ucar.edu.

www.ccsm.ucar.edu

$NTASKS, $NTHRDS, batch$NTASKS, $NTHRDS, batch

• $NTASKS are the total number of MPI tasks for each component

• $NTHRDS are the number of OpenMP threads per MPI task

• $NTASKS*$NTHRDS = total number of processors for each component

• Tuning required to get optimal load balance• Batch parameters should match processors

used, consistency important, task_geometry (loadleveler) is very powerful

Page 13: Www.ccsm.ucar.edu Running CCSM Tony Craig CCSM Software Engineering Group ccsm@ucar.edu.

www.ccsm.ucar.edu

Component parallelizationComponent parallelization

• atm : MPI, OpenMP, or MPI/OpenMP• lnd : MPI, OpenMP, or MPI/OpenMP• ice : MPI only, NTHRDS=1• ocn : MPI only, NTHRDS=1• cpl : OpenMP only, NTASKS=1• The data models, datm, docn, dice, dlnd, and

latm : serial only, 1 processor, NTASKS=1, NTHRDS=1

Page 14: Www.ccsm.ucar.edu Running CCSM Tony Craig CCSM Software Engineering Group ccsm@ucar.edu.

www.ccsm.ucar.edu

Main script configuration summaryMain script configuration summary

• B case

MODELS ( atm lnd ocn ice cpl)

SETUPS ( atm lnd ocn ice cpl)

NTASKS ( 8 2 40 8 1)

NTHRDS ( 4 4 1 1 4)

• datm/dlnd/ocn/ice case

MODELS ( atm lnd ocn ice cpl)

SETUPS ( datm dlnd ocn ice cpl)

NTASKS ( 1 1 64 16 1)

NTHRDS ( 1 1 1 1 4)

Page 15: Www.ccsm.ucar.edu Running CCSM Tony Craig CCSM Software Engineering Group ccsm@ucar.edu.

www.ccsm.ucar.edu

$RUNTYPE$RUNTYPE

• Startup - initial startup of model using arbitrary initialization– set $CASE, $BASEDATE

• Continue - continuation of case, bit-for-bit guaranteed, uses model restart files– set $CASE

• Branch - start new case as a bit-for-bit continuation of another case, uses model restart files, requires continuous date– set $CASE, $REFCASE, $REFDATE

• Hybrid - start new case, not bit-for-bit continuation, uses model initial files in atm and land, can change starting date– set $CASE,$BASEDATE,$REFCASE,$REFDATE

Page 16: Www.ccsm.ucar.edu Running CCSM Tony Craig CCSM Software Engineering Group ccsm@ucar.edu.

www.ccsm.ucar.edu

Coupler namelistCoupler namelist

• Stop_option: ndays, nmonths, newmonth, halfyear, newyear, newdecade

• Stop_n : integer (ndays, nmonths)

• Rest_freq : ndays, monthly, quarterly, halfyear, yearly• Rest_n : integer (ndays)

• Diag_freq : daily, weekly, biweekly, monthly, quarterly, yearly, ndays

• Diag_n : integer (ndays)

• info_bcheck : integer

Page 17: Www.ccsm.ucar.edu Running CCSM Tony Craig CCSM Software Engineering Group ccsm@ucar.edu.

www.ccsm.ucar.edu

Data SetsData Sets

• Types– Grid files, binary– Namelist input, ascii– Initial datasets, binary/netcdf– Restart datasets, binary– History datasets, netcdf– Log files, ascii

• inputdata directory– This is usually pointed to by $CSMDATA

Page 18: Www.ccsm.ucar.edu Running CCSM Tony Craig CCSM Software Engineering Group ccsm@ucar.edu.

www.ccsm.ucar.edu

Data Flow, InputData Flow, Input

• Everything is copied to $EXEROOT• Tools and scripts attempt to automate most of the

“get input files”• Main script variables include $CSMDATA, $LFSINP,

$LMSINP, $MACINP, $RFSINP, $RMSINP

$EXEROOT

Mass Store

$ARCROOT/restart

$CSMDATA = inputdata

scripts/$CASE

Setup scripts

Page 19: Www.ccsm.ucar.edu Running CCSM Tony Craig CCSM Software Engineering Group ccsm@ucar.edu.

www.ccsm.ucar.edu

Data Flow, OutputData Flow, Output

• Output files are moved out of $EXEROOT• Harvesting is a separate process• Writing of restart files coordinated by the coupler• Writing of history files is not coordinated between

components, monthly average is default• Main script variables include $LMSOUT, $MACOUT,

$RFSOUT

$EXEROOTMass Store

$ARCROOT

Scripts

archivingharvesting

Page 20: Www.ccsm.ucar.edu Running CCSM Tony Craig CCSM Software Engineering Group ccsm@ucar.edu.

www.ccsm.ucar.edu

Log FilesLog Files

• Each component produces a log file, $model.log.$LID• $LID is a system date stamp• Date stamps are the same on all log files for a run• Log files are written into the $EXEROOT/$model

directories during execution• Log files are copied to $SCRIPTS/logs at the end of a

run• There are separate stdout and stderr that sometimes

contain output information

Page 21: Www.ccsm.ucar.edu Running CCSM Tony Craig CCSM Software Engineering Group ccsm@ucar.edu.

www.ccsm.ucar.edu

Archiving, ccsm_archiveArchiving, ccsm_archive

• Means moving model output to a separate area on a local disk, ccsm_archive

• Local disk area is set by $ARCROOT in the main script

• Benefits– Allows separation of running and harvesting– Mass storage availability does not prevent

continued execution of the model– Allows users to run in volatile temporary space– Supports simple harvesting in a clustered

machine environment (like nirvana)

Page 22: Www.ccsm.ucar.edu Running CCSM Tony Craig CCSM Software Engineering Group ccsm@ucar.edu.

www.ccsm.ucar.edu

Harvesting, $CASE.harHarvesting, $CASE.har

• Means copying model output to the local mass store• Separate script in scripts/$CASE, $CASE.har• Typically submitted in batch, can also be run

interactively• Submitted by main script after model run, off by

default• Sources ccsm_joe for important environment

variables• Harvests all files in $ARCROOT/{atm,lnd,ocn,ice,cpl}• Verifies accurate copy on mass store before

removing• Can scp files to remote machines

Page 23: Www.ccsm.ucar.edu Running CCSM Tony Craig CCSM Software Engineering Group ccsm@ucar.edu.

www.ccsm.ucar.edu

Exact RestartExact Restart

• CCSM can stop and restart exactly

• The coupler controls the frequency of restart file writes

• Restart files guarantee bit-for-bit continuity at a checkpoint boundary

• rpointer files are updated in the scripts/$CASE directory after each run

Page 24: Www.ccsm.ucar.edu Running CCSM Tony Craig CCSM Software Engineering Group ccsm@ucar.edu.

www.ccsm.ucar.edu

Restart file management (1)Restart file management (1)

• ccsm_archive– In scripts/$CASE– Called from main script after model run is

complete, commented out by default– $ARCROOT/restart contains the latest full set of

restart files– ccsm_archive copies full set of restart datasets

into $ARCROOT/restart after each run– ccsm_archive then tars up that restart set into the

$ARCROOT/restart.tars directory– These tar files can be large, regular clean up

required

Page 25: Www.ccsm.ucar.edu Running CCSM Tony Craig CCSM Software Engineering Group ccsm@ucar.edu.

www.ccsm.ucar.edu

Restart file management (2)Restart file management (2)

• ccsm_getrestart– In scripts/tools– Called from main script before model run starts,

commented out by default– Copies the latest set of restart files from

$ARCROOT/restart to the appropriate directories

• To “backup” model run to previous model date– Assumes both ccsm_archive and ccsm_getrestart

have been active in the main script– Delete all files in $ARCROOT/restart– Untar an $ARCROOOT/restart.tars file into

$ARCROOT/restart– Resubmit

Page 26: Www.ccsm.ucar.edu Running CCSM Tony Craig CCSM Software Engineering Group ccsm@ucar.edu.

www.ccsm.ucar.edu

Auto-ResubmitAuto-Resubmit

• RESUBMIT file in scripts/$CASE directory– contains a single integer– If the integer is >0, main script resubmits

itself and decrements the integer

• Runaway jobs– FIRST! set value in RESUBMIT file to 0– Attempt to kill running jobs

Page 27: Www.ccsm.ucar.edu Running CCSM Tony Craig CCSM Software Engineering Group ccsm@ucar.edu.

www.ccsm.ucar.edu

ProductionProduction

• Modify coupler namelist in cpl.setup.csh, set run length and restart frequency, turn down diagnostic frequency, set info_bcheck to 0.

• Run a startup, hybrid, or branch case $RUNTYPE

• Transition to continue $RUNTYPE• Turn on archiving, harvesting, and

ccsm_getrestart• Edit RESUBMIT file to initiate auto-

resubmission

Page 28: Www.ccsm.ucar.edu Running CCSM Tony Craig CCSM Software Engineering Group ccsm@ucar.edu.

www.ccsm.ucar.edu

Monitoring a runMonitoring a run

• Monitor the batch jobs using llq, bjobs, qstat• Verify that runs complete successfully, check

for timing information at the end of a log file• Tail -f $EXEROOT/cpl/cpl.log*• If runs are not succeeding,

– tail each log file– grep for ENDRUN in atm and lnd log files– Check stdout and stderr files for component

messages or system messages– Look for core files in $EXEROOT/$model– Look for zero length files in $EXEROOT/$model– Check email

Page 29: Www.ccsm.ucar.edu Running CCSM Tony Craig CCSM Software Engineering Group ccsm@ucar.edu.

www.ccsm.ucar.edu

Modifying source codeModifying source code

• Modifying files in the ccsm models directory is not recommended

• Create directories under scripts/$CASE– src.atm, src.lnd, src.ocn, src.ice, src.cpl– Copy subset of model source code to these

directories and modify it– Has highest priority with respect to build

• Benefits include– Release source code remains unmodified and

available– Allows implementation of case dependent code

modifications

Page 30: Www.ccsm.ucar.edu Running CCSM Tony Craig CCSM Software Engineering Group ccsm@ucar.edu.

www.ccsm.ucar.edu

Multiple Machine SupportMultiple Machine Support

• Should run on blackforest, babyblue, and ute “out of the box”

• “Other” machines include seaborg, nirvana, eagle, falcon, cheetah

• Supported platforms are indicated in $OS, $SITE, $MACH, $ARCH environment variables in the main script

• See also scripts/tools/test.a1.mods.$MACH for suggested changes to test.a1.run for “other” machines.

Page 31: Www.ccsm.ucar.edu Running CCSM Tony Craig CCSM Software Engineering Group ccsm@ucar.edu.

www.ccsm.ucar.edu

Running on a “New” MachineRunning on a “New” Machine

• Main script– Set batch queue commands– Add new $OS, $SITE, $MACH, $ARCH options– Set standard CCSM path names, $CSMROOT, …– Harvester submission issues– Set data movement variables, $LMSINP, …

• Harvester script– May require modification

• Tools– May need to modify ccsm_msread, ccsm_mswrite

• Build– Modify models/bld/Macros.$OS file

Page 32: Www.ccsm.ucar.edu Running CCSM Tony Craig CCSM Software Engineering Group ccsm@ucar.edu.

www.ccsm.ucar.edu

ccsm_joeccsm_joe

• Created by main script

• Updated every time the main script runs

• Case dependent

• Records important ccsm environment variables

• Can be “sourced” by other scripts to inherit ccsm environment variables

Page 33: Www.ccsm.ucar.edu Running CCSM Tony Craig CCSM Software Engineering Group ccsm@ucar.edu.

www.ccsm.ucar.edu

Interactive/Batch IssuesInteractive/Batch Issues

• Can run main script interactively• Typically used to build and pre-stage initial

data• Uncomment “exit” command in main script to

stop the script before script starts ccsm execution

• Batch environment highly site dependent– NQS– Loadleveler– LSF– PBS

Page 34: Www.ccsm.ucar.edu Running CCSM Tony Craig CCSM Software Engineering Group ccsm@ucar.edu.

www.ccsm.ucar.edu

Common Errors (1)Common Errors (1)

• Model won’t build– Try rebuilding clean– Remove all obj directories, these are

$OBJROOT/model/obj which is normally equivalent to $EXEROOT/model/obj

– When rebuilding, make sure $SETBLD is true in main script

• Model won’t continue due to restart problem– Determine cause of problem; quota, hardware,

script, zero length files, rpointer problems– Fix if possible– Back up to latest “good” restart dataset– Rerun

Page 35: Www.ccsm.ucar.edu Running CCSM Tony Craig CCSM Software Engineering Group ccsm@ucar.edu.

www.ccsm.ucar.edu

Common Errors (2)Common Errors (2)

• Ice model stops due to mp transport error– Double ndte in ice.setup.csh ice model namelist– Back up to latest “good” restart dataset– Run past previous stop date– Reset ndte value

• Ocean model non-convergence– Add about 10% to the number of model

timesteps/hour in ocn.setup.csh, DT_COUNT– Back up to latest “good” restart dataset– Run past previous stop date– Reset DT_COUNT– Non-convergence on first timestep is special case

Page 36: Www.ccsm.ucar.edu Running CCSM Tony Craig CCSM Software Engineering Group ccsm@ucar.edu.

www.ccsm.ucar.edu

ToolsTools

• Under scripts/tools– ccsm_getfile : hierarchical search for file– ccsm_getinput : hierarchical search for input file– ccsm_msread : copies a file from local mass store– ccsm_mswrite : copies a file to local mass store– ccsm_checkenvs : echo ccsm environment

variables, used to created ccsm_joe– ccsm-getrestart : copies restart files from

$ARCROOT/restart to appropriate $EXEROOT and scripts/$CASE directories

Page 37: Www.ccsm.ucar.edu Running CCSM Tony Craig CCSM Software Engineering Group ccsm@ucar.edu.

www.ccsm.ucar.edu

PerformancePerformance

• This is complicated!• Issues

– Performance of components and system as a function of resolution and configuration

– Scalability of individual components, scaling efficiency of individual components

– Task/Thread counts– Components sharing nodes, overloading nodes

with multiple components, overloading threads, overloading tasks

– Load balance of coupled system

Page 38: Www.ccsm.ucar.edu Running CCSM Tony Craig CCSM Software Engineering Group ccsm@ucar.edu.

www.ccsm.ucar.edu

Component TimingsComponent Timings

0

50

100

150

200

250

300

4 8 16 32 64

Number of processors

Seconds/simulated day

atmlndiceocn

Page 39: Www.ccsm.ucar.edu Running CCSM Tony Craig CCSM Software Engineering Group ccsm@ucar.edu.

www.ccsm.ucar.edu

CCSM Load BalancingCCSM Load Balancing

40 ocean

32 atm

16 ice

12 land

04 cpl

104 total

9.4 3.0

6.2 15.0

8.6 40.4

53.2

10.0 10.0

55

3 2

Timings in seconds per day

5

processors

Page 40: Www.ccsm.ucar.edu Running CCSM Tony Craig CCSM Software Engineering Group ccsm@ucar.edu.

www.ccsm.ucar.edu

Component/Hardware layoutComponent/Hardware layout

• Machine, set of nodes• Nodes, group of processors that share

memory• Processors, individual computing elements• General rules

– Do not oversubscribe processors, place only 1 MPI task or 1 thread on each processor

– Minimize the number of nodes used for a given component and processor requirement

– Multiple components can share a node as long as there is no oversubscription of processors

– Test several decompositions, layouts, task/thread combinations to try to optimize performance

Page 41: Www.ccsm.ucar.edu Running CCSM Tony Craig CCSM Software Engineering Group ccsm@ucar.edu.

www.ccsm.ucar.edu

SummarySummary

• CCSM is a complicated multi-executable climate model, expect there to be “spin-up” time

• CCSM is a scientific research code• There are many possible components,

configurations, platforms, and resolutions; we are unable to test everything

• Users are responsible for validating their science• NCAR can help with software/configuration problems,

[email protected]• Please report bugs, fixes, improvements, and ports to

new hardware, so we can incorporate those changes! [email protected]