Watchdog: A job monitoring solution inside the EELA-2 Infrastructure
description
Transcript of Watchdog: A job monitoring solution inside the EELA-2 Infrastructure
www.eu-eela.eu
E-science grid facility forEurope and Latin America
Watchdog: A job monitoring solution inside the EELA-2 Infrastructure
Riccardo Bruno, Roberto Barbera, Elisa Ingrà
INFN Sez. Catania (Italy)
2nd EELA-2 Conference
Choroni (Venezuela), 25-27.11.2009
www.eu-eela.eu
Job Monitoring in gLite
Before gLite v3.1 no job monitoring systems were available
• Jobs running into the WNs are considered as Black Boxes• No prompted job status retrieval (Done/Abort/…)• Output Sandbox available only after WMS recognize job completion
• This situation was not good for jobs requesting very long computational time.
2 Choroni (Venezuela), 2nd EELA-2 Conference, 25-27.11.2009
Jobs
WMS
CE
CE
CE
WNs
WN
?Output
SandBox
www.eu-eela.eu
Analysis• Need
– Get in touch with the jobs running into the WN (especially for long term jobs) monitoring and controlling their execution.
• How– Perform job control and monitoring using grid services in the less
invasive way for the application.
• Observations– Almost all Grid jobs are piloted by a main shell script:
Get precious info in case of faults Pilot complex batch workflows
– Both AMGA and SE+LFC can be used as a basic Grid Info System lfc-* and lcg-* tools already available for Grid file management mdcli AMGA command can be used by jobs on the WNs cp command in case of shared file system on the WN The latency of CLI tools is very low compared to long term jobs
3 Choroni (Venezuela), 2nd EELA-2 Conference, 25-27.11.2009
www.eu-eela.eu
Requirements
• Monitor job execution timely watching files produced by the job while it executes on the WN– File snapshots will be reported on LFC+SE, AMGA servers or
mounted shared FSs
• It would be useful to configure the monitoring tool accordingly to the user needs– The monitoring tool will consist only of bash script files– Few shell environment variables can be used to configure
the monitoring behavior
• Control the job execution accessing directly on the WN– It is possible to send user commands on the WN– It is possible to change the monitoring while the Grid job runs
4 Choroni (Venezuela), 2nd EELA-2 Conference, 25-27.11.2009
www.eu-eela.eu
The Watchdog• The Watchdog consists of set of shell scripts to be included in the
JDL InputSandbox and then called by the pilot script.• Watchdog features:
– It starts in background before to run the Grid job
– The watchdog runs as long as the main job
– The monitoring process can be piloted until the pilot scripthas not finished
– Easily configurable and customizable
– The watchdog does not compromise the CPU power of the WN– The watchdog can be used with MPI jobs– Files may be fully or partially reported (only last changes)
5 Choroni (Venezuela), 2nd EELA-2 Conference, 25-27.11.2009
www.eu-eela.eu
WD Main Components• watchdog.sh
– The WD core main script, it is the responsible of the job monitoring file snapshot reporting and user command execution
• watchdog.ctrl– This script controls the execution of the WD core script; it can:
start, stop, pause and resume the WD.It can be also used to: alter the time interval add/remove files to watch and change reporting strategy (full/partial)
• watchdog.conf– This script contains all environment variables needed to
configure the WD–
The use of AMGA reporting requires more files
Choroni (Venezuela), 2nd EELA-2 Conference, 25-27.11.2009 6
www.eu-eela.eu
WD Additional Components
• getinfo.sh / setinfo.shgetcontent.sh / setcontent.sh (AMGA)
– Utilities to set/get WD reported information from/to AMGA metadata catalog
• uuencode / uudecode (shareutils) (AMGA)
– Executables needed by WD to encode binaries and multiline text content into the AMGA metadata catalog in Base64 text format.
– In EELA-2 (prod VO) available into: $VO_PROD_VO_EU_EELA_EU_SW_DIR
• wdcli– CLI application to let the user interact with the WD
Choroni (Venezuela), 2nd EELA-2 Conference, 25-27.11.2009 7
www.eu-eela.eu
WD Usage
Choroni (Venezuela), 2nd EELA-2 Conference, 25-27.11.2009 8
1. Configure the Watchdog setting the watchdog.conf file
2. Applications using Watchdog MUST include the files: watchdog.sh, watchdog.ctrl, watchdog.conf,uuencode,uudecode (in case of AMGA reporting) or configure the PATH VO_PROD_VO_EU_EELA_EU_SW_DIR in the WN
3. Call the watchdog.ctrl into the pilot script
Type = "Job";
JobType = "Normal";
Executable = "/bin/bash";
StdOutput = "file.out";
StdError = "file.err";
InputSandbox = {"watchdog.sh", "watchdog.ctrl", "watchdog.conf","uuencode", "uudecode", "AppPilotScript.sh"};
OutputSandbox = {"MyApp.out","MyApp.err", "watchdog.log”,"watchdog.err"};
Arguments = "AppPilotScript.sh";
App JDL
#!/bin/sh
…
# prepare and start the watchdog
PATH=${VO_PROD_VO_EU_EELA_EU_SW_DIR}\/:${PATH}:.
chmod +x watchdog.*./watchdog.ctrl start
#run application …
# Use the ./watchdog.ctrl
# to control the WD anytime
#stop and wait the watchdog completes
./watchdog.ctrl stop
AppPiloyScript.sh
www.eu-eela.eu
WD Interaction
Choroni (Venezuela), 2nd EELA-2 Conference, 25-27.11.2009 9
<BASEPATH>/6-tPC2d2knO7m6GP2XC7-Q
_watchdog/
091002232421_wdcli_cmd1.cmd
091002232421_wdcli_cmd1.err
091002232421_wdcli_cmd1.out
...
091002232729_wdcli_cmd7.cmd
091002232729_wdcli_cmd7.err
091002232729_wdcli_cmd7.out
WDEND or WDPID
WDENV
WDHST
cmdlist/
wdcli_cmd8
091002231841_13156_file.err
091002231853_13156_file.out
091002231904_13156_watchdog.err
…
091002232836_13156_watchdog.log
6-tPC2d2knO7m6GP2XC7-Q
Flags
WD Control DIRwatchdog.conf
WD CMD Exe DIR
OUT
ERR
CMDwatchdog.sh
WN
File snapshots
LFC/AMGAMounted Sh FS
www.eu-eela.eu
wdcli
• CLI to ease the WD user interaction– 20091124164201 wd>
• Uses the watchdog.conf file to get user configuration• Principal commands:
– set Set MODE (LFC/AMGA/mounted Shared FS)– show jobs Get list of monitored jobs– Attach to a monitored job– show snapshots Get the list of file snapshots– View the snapshot content– Get generic info: ENV,PID,CE,WN,Proxy …– exec Execute a given command
Interactive commands are not allowed It is possible to call the watchdog.ctrl command (use –n opt!)
Choroni (Venezuela), 2nd EELA-2 Conference, 25-27.11.2009 10
www.eu-eela.eu
WD in EELA-2• Presented 1st time in E2GRIS1 at Itacuruca (Brazil)
– G-HMMER/G-InterProScan Bioinformatic – Get semi-real time info to be published on the WEB
– CrossFire Civil Protection – Get semi-real time info to view the simulation output
• Presented the 2nd time in E2GRIS2 at Qeretaro (Mexico)– HeMoLab
Bioinformatic – Long run jobs, check output files while running
– AeroVANT Engineering – Long run jobs, get data while running
– BioMD Bioinformatic – Long run job, monitor the simulation
– Seismic Sensors (planned to)
Earth Science – Monitor the job execution
• Cinefilia Recommender Systems – Monitor the computation
Choroni (Venezuela), 2nd EELA-2 Conference, 25-27.11.2009 11
www.eu-eela.eu
Conclusions• WD mainly used for:
– Job monitoring (Long run)– Check/Get job produced data
• WD used as:– As a Debugging helper tool– As an application component (CrossFire)
• WD easy to integrate but needs a precise configuration– EELA-2 has 2 different AMGA server using different access rights
(EU and LA)– EELA-2 does not have shareutils (uuencode/uudecode) package
installed on the WNs. These tools available under WN path: VO_PROD_VO_EU_EELA_EU_SW_DIR or put ‘uu**code’ commands in the InputSandbox
– EELA-2 several WNs were using a different BDII, some users were unable to retrieve easily the snapshot content (LFC)
Choroni (Venezuela), 2nd EELA-2 Conference, 25-27.11.2009 12
www.eu-eela.eu
Future
• Improve the User Interaction– Improve wdcli (due to the good success in E2GRIS2)
– Create tools to easily create web based front ends– Provide tools to reconstruct a file monitored incrementally
• Ease the application integration (AMGA)– uuencode/uudecode independent– provide watchdog.conf file templates for VOs
• Improve the Monitoring– Provide independent time watching cycles for each file– Provide a sandboxing mechanism for file I/O from/to WN
Choroni (Venezuela), 2nd EELA-2 Conference, 25-27.11.2009 13
www.eu-eela.eu 14www.eu-eela.euwww.eu-eela.eu
Questions?