Post on 02-Jan-2016
description
13 mai 14 : AtelierSupervisionMatin : les messages sortants•IPSL : introduction générale : supervision de la chaine de calcul et post-traitements au TGCC et à l'IDRIS•IPSL : besoins et esquisses de solutions pour l'envoi de messages d’informations depuis les centres de calcul vers l’IPSL•IDRIS : outils existants et contraintes•TGCC : outils existants et contraintes•Discussion•Plan d'action
Après-Midi : les jobs entrants•IPSL : besoins de relance de jobs de calcul et/ou de post-traitements depuis l’IPSL•IDRIS : soumission de jobs : quelles solutions? quelles API? quelles contraintes?•TGCC : soumission de jobs : quelles solutions? quelles API? quelles contraintes?•Discussion•Plan d’action
T0 : management
T1 : platform T2 : towards a high-resolution coupled model
T3 : runtime environments
T4 : Big Data management and analytics of Climate Simulations
T5 : CliMAF: a framework for climate models evaluation and analysis
ensemble of toolsdifferent configurationsdifferent resolutionset of simulationsset of diagnosticsassessment
•Improving coupled model parallelism in terms of computing and memory•Managing efficiently input and restart files•Integrating parallel interpolation mechanisms in XIOS•Parallel component coupling
•Process assignment•Optimization, Load balancing•Climate Simulations Supervision
•XIOS implemented within project models •XIOS a bridge towards standardisation •Data and metadata services•Big Data Analytics
•General driver and upstream user interface•Services layer•Visualization tools •Evaluation and monitoring diagnostics
IPSL implementation
GAME-CERFACS implementation
ANR MN2013 CONVERGENCE
Task 3.3 : Runtime Environment
Leaders : Arnaud Caubel and Marie-Alice FoujolsContributors : IPSL, CERFACS, IDRIS, CNRS-GAMEHelp and expertise : TGCC, MDLS
task IPSL CERFACS CNRS-GAME
IDRIS TGCC MDLS
3.1 Process assignment X X X
3.2 Optimisation, load balancing
X X X
3.3 Climate Simulations Supervision
X X X
Task 3.3 : Climate Simulations Supervision
Launch a simulation
with libIGCM
IDRIS or TGCC
Computing Post-traitment
Supervisor User
Jussieu?
Commands
Objective : libIGCM self-healing application : more reliability, less human intervention
Task 3.3 : Climate Simulations SupervisionContext
– one simulation• 3 weeks running, 100 000 files, 25 TB, 1000 jobs : 40 computing and post-processing
– static workflow vs dynamic workflow
Development of a supervisor agent– detect and understand failure event– understand the ultimate goals of the workflow– re-plan, re-schedule, re-map the workflow
Tasks for the supervisor – events log in a comprehensive call tree (job sub., work to be done, each cp, ....)– reliable lightweight communication channel between client agents and server
agents (RabbitMQ implementation of AMPQ)– call tree traversal capabilities to determine checkpoint restart– autonomous rescheduling of necessary jobs– monitoring capabilities : coloured graphs with all jobs and status– regression tests handling capabilities
Task 3.3 : Climate Simulations Supervision
Milest. Date Description IPSL CERFACS CNRS-GAME
IDRIS TGCC MDLS
MS3.3a M12 : 10/2014 Supervisor agent Architectural Design X X X x
MS3.3b M24 : 10/2015 Supervisor agent release candidate: enabling control channel, full events logs, call tree traversal capabilities and regression test handling
X X X x
MS3.3c M48 : 10/2017 Supervisor agent final release : succesful rescheduling for known failure
X X X x
Task 3.3 : Climate Simulations Supervision
Additional manpower :• CDD 21 pm IPSL (tasks 3.1, 3.2, 3.3) + CDD/IDRIS 6 pm• Subcontractor IPSL 42 pm (tasks 3.3)• TGCC/CEA : prestation ?
Success criteria :• A significant number of “standard” (ie “nonexpert”) users of Earth System model
launch typical climate simulation (including development done in this WP) using libIGCM runtime environment on HPC centres (IDRIS and TGCC)
Identified risks :• if it's not possible to install supervisor agent : lighter installation with warning
instead of correction• the supervisor must be as transparent as possible : lighter usage ie des/activation
of main tasks/secondary tasks
Planning for next 6 months :• Meeting/workshop to plan to discuss “Supervisor Design” (task 3.3)
RebuildFrequency
PackFrequency
SeasonalFrequency
TimeSeriesFrequency
Com
putin
g jo
b
Post
-pro
cess
ing
jobs
PackFrequency
PackFrequency
PeriodLength , PeriodNb
TGCC computers and file system in a nutshell
curiehybrid nodes
-q hybrid
curiehybrid nodes
-q hybrid
curiethin nodes-q standard
curiethin nodes-q standard
curielarge nodes
-q xlarge
curielarge nodes
-q xlarge
dods/storedods/store
$HOME
$CCCSTOREDIR
$CCCWORKDIR
$SCRATCHDIR
HPSS : Robotic tapes
curiefront-end
curiefront-end
Computers
sourcessmall results IGCM_OUT :
MONITORING/ATLAS
temporary REBUILDIGCM_OUT :
files to be packedoutputs of post-proc
jobs
IGCM_OUT : Packed results
Output, Analyse SE and TS
Small precious filesSaved space
File system
dods_cp
cp
ccc_hsm get
airainfront-end
airainfront-end
airainnodesairainnodes
cpdods/workdods/workdods_cp
October 2013Temporary
spaceSaved space
Non saved space
Space on tapes
computecompute
loginlogin
Visible from www
quotasquotas
Job_EXP00Job_EXP00
Com
pute
curie
Job_EXP00Job_EXP00 Job_EXP00Job_EXP00
TGCC PeriodLength PeriodLength
$SCRATCHDIR/IGCM_OUT/.../REBUILD
$SCRATCHDIR/IGCM_OUT/XXX/Restart Debug
DodsCopy=TRUE/FALSE
ncrcat
PackFrequency
$CCCSTOREDIR/IGCM_OUT/XXX/Output
pack_outputpack_output
PackFrequency
$CCCSTOREDIR/IGCM_OUT/.../RESTART DEBUG
Post
curietarpack_restart
pack_debugpack_restartpack_debug
create_tscreate_ts
curiemonitoringmonitoring
Post
TimeSeriesFrequency
TS et SE : $CCCSTOREDIR/IGCM_OUT/… dods/storeMONITORING et ATLAS : $CCCWORKDIR dods/work
create_secreate_se
SeasonalFrequency
atlasatlas
$SCRATCHDIR/IGCM_OUT/XXX/Output
Post
RebuildFrequency
rebuildrebuild
curie
IDRIS computers and file system in a nutshell
dodsdods
$HOME
$HOME
$WORKDIR $WORKDIR
Robotic tapesIGCM_OUT :
Output, AnalyseMONITORING/
ATLAS
$HOME
$TMPDIR
sourcessmall results
temporary REBUILDIGCM_OUT :
files to be packedoutputs of post-proc
jobs
gayagaya
mfput/mfget
dods_cp
mfput/mfget
dmput/dmget
adappcomputeadapp
computeada
computeada
computeadapp
front-endadapp
front-endturing
front-endturing
front-endturingcalculturingcalcul
$TMPDIR $TMPDIR
October 2013Temporary
spaceSaved space
Non saved space
Space on tapes
Visible from www
File system
computecompute
loginlogin
Small precious filesSaved space
Job_EXP00Job_EXP00
Com
pute
ada
Job_EXP00Job_EXP00 Job_EXP00Job_EXP00
IDRIS PeriodLength PeriodLength
$WORKDIR/IGCM_OUT/.../REBUILD
$WORKDIR/IGCM_OUT/XXX/Restart Debug
DodsCopy=TRUE/FALSE
ncrcat
PackFrequency
gaya:IGCM_OUT/XXX/Output
pack_outputpack_output
PackFrequency
gaya:IGCM_OUT/.../RESTART DEBUG
Post
adapptarpack_restart
pack_debugpack_restartpack_debug
create_tscreate_ts
adappmonitoringmonitoring
Post
TimeSeriesFrequency
gaya:IGCM_OUT/… dods.idris.fr
create_secreate_se
SeasonalFrequency
atlasatlas
Post
RebuildFrequency
rebuildrebuild
$WORKDIR/IGCM_OUT/XXX/Output
adapp
CM5AEH01 – 500 ans : 1850-2349 • 500 ans• PeriodLength=1M 240 jobs de calcul, PeriodNb=12, 60• RebuildFrequency=1Y 432 rebuild• PackFrequency=10Y 43 pack_debug, 43 restart, 43 output• SeasonalFrequency=50Y 8 create_se et 32 atlas • TimeSeriesFrequency=10Y 757 create_ts et 43 monitoring
• 12 interventions manuelles 12/1641 = 0,73%
Incident Détection Remède Supervision
1- Fatal calcul Un mail Clean_month et relance
3 tentatives
2- Fatal post Fatal calcul
Deux mail Clean_year et relance
3 tentatives
3- Job de calcul absent
Manuel : se connecter et RunChecker.job
Clean_month et relance
heartbit
CM5AEH01 : erreurs rencontrées Erreur job calcul :•2123-11 et 2130-07: Fatal : error writing restartphy, job bloqué qq heures,
clean_month et relance•2206-04 : Fatal : erreur SLURM ,
clean_month et relance•2249-03 : Fatal : 3h de blocage, killed,
clean_month et relance
Erreur job post-traitements :•1999, 2000, 2118 et 2127 : pack_restart (1999) et rebuild parti en time limit
pack_r et rebuild relancé et, si besoin, pack_output (2119, 2129)•2166 et 2174 : rebuild KO IGCM_sys_rebuild[1860]: /ccc/cont003/home/dsm/p86ipsl/X64_CURIE/bin/rebuild: cannot execute [Permission denied],
rebuild relancé •2059, 2079, 2119 et 2129 : pack_output lancé trop tôt pack_output relancéAutres erreurs :•13 monitoring KO : problèmes d’environnement instable (nco) entre 10/3 et 30/31 •1 sub rebuild KO, resource temporarily unav : 3 tentatives dans libIGCM v2.2 •IDRIS : disparition tous jobs
13 mai 14 : Atelier SupervisionMatin : les messages sortants•IPSL : introduction générale : supervision de la chaine de calcul et post-traitements au TGCC et à l'IDRIS•IPSL : besoins et esquisses de solutions pour l'envoi de messages d’informations depuis les centres de calcul vers l’IPSL•IDRIS : outils existants et contraintes•TGCC : outils existants et contraintes•Discussion•Plan d'action
Après-Midi : les jobs entrants•IPSL : besoins de relance de jobs de calcul et/ou de post-traitements depuis l’IPSL•IDRIS : soumission de jobs : quelles solutions? quelles API? quelles contraintes?•TGCC : soumission de jobs : quelles solutions? quelles API? quelles contraintes?•Discussion•Plan d’action