Monitoring HTCondor Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins...
-
Upload
marvin-heath -
Category
Documents
-
view
223 -
download
5
Transcript of Monitoring HTCondor Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins...
Monitoring HTCondor
Andrew Lahiff
STFC Rutherford Appleton Laboratory
European HTCondor Site Admins Meeting 2014
Introduction• Two aspects of monitoring
– General overview of the system• How many running/idle jobs?
By user/VO? By schedd?
• How full is the farm?
• How many draining worker nodes?
– More detailed views• What are individual jobs doing?
• What’s happening on individual worker nodes?
• Health of the different components of the HTCondor pool
• ...in addition to Nagios
Introduction• Methods
– Command line utilities
– Ganglia
– Third-party applications(which run command-line tools or use python API)
Command line• Three useful commands
– condor_status• Overview of the pool (including jobs, machines)• Information about specific worker nodes
– condor_q• Information about jobs in the queue
– condor_history• Information about completed jobs
Overview of jobs-bash-4.1$ condor_status -collector
Name Machine RunningJobs IdleJobs HostsTotal
[email protected]. condor01.gridpp.rl 10608 8355 11347
[email protected]. condor02.gridpp.rl 10616 8364 11360
Overview of machines-bash-4.1$ condor_status -total
Total Owner Claimed Unclaimed Matched Preempting Backfill
X86_64/LINUX 11183 95 10441 592 0 0 0
Total 11183 95 10441 592 0 0 0
Jobs by schedd-bash-4.1$ condor_status -schedd
Name Machine TotalRunningJobs TotalIdleJobs TotalHeldJobs
arc-ce01.gridpp.rl.a arc-ce01.g 2388 1990 13
arc-ce02.gridpp.rl.a arc-ce02.g 2011 1995 31
arc-ce03.gridpp.rl.a arc-ce03.g 4272 1994 9
arc-ce04.gridpp.rl.a arc-ce04.g 1424 2385 12
arc-ce05.gridpp.rl.a arc-ce05.g 1 0 6
cream-ce01.gridpp.rl cream-ce01 266 0 0
cream-ce02.gridpp.rl cream-ce02 247 0 0
lcg0955.gridpp.rl.ac lcg0955.gr 0 0 0
lcgui03.gridpp.rl.ac lcgui03.gr 3 0 0
lcgui04.gridpp.rl.ac lcgui04.gr 0 0 0
lcgvm21.gridpp.rl.ac lcgvm21.gr 0 0 0
TotalRunningJobs TotalIdleJobs TotalHeldJobs
Total 10612 8364 71
Jobs by user, schedd-bash-4.1$ condor_status -submitters
Name Machine RunningJobs IdleJobs HeldJobs
group_ALICE.alice.alice043@g arc-ce01.gridpp.rl 0 0 0
group_ALICE.alice.alicesgm@g arc-ce01.gridpp.rl 540 0 1
group_ATLAS.atlas_pilot.tatl arc-ce01.gridpp.rl 142 0 0
group_ATLAS.prodatls.patls00 arc-ce01.gridpp.rl 82 5 0
group_CMS.cms.cmssgm@gridpp. arc-ce01.gridpp.rl 1 0 0
group_CMS.cms_pilot.ttcms022 arc-ce01.gridpp.rl 214 390 0
group_CMS.cms_pilot.ttcms043 arc-ce01.gridpp.rl 68 100 0
group_CMS.prodcms.pcms004@gr arc-ce01.gridpp.rl 78 476 4
group_CMS.prodcms.pcms054@gr arc-ce01.gridpp.rl 12 910 0
group_CMS.prodcms_multicore. arc-ce01.gridpp.rl 47 102 0
group_DTEAM_OPS.ops.ops047@g arc-ce01.gridpp.rl 0 0 0
group_LHCB.lhcb_pilot.tlhcb0 arc-ce01.gridpp.rl 992 0 2
group_NONLHC.snoplus.snoplus arc-ce01.gridpp.rl 0 0 0
…
…Jobs by user RunningJobs IdleJobs HeldJobs
group_ALICE.alice.al 0 0 0
group_ALICE.alice.al 3500 368 5
group_ALICE.alice_pi 0 0 0
group_ATLAS.atlas.at 0 0 0
group_ATLAS.atlas.at 0 0 0
group_ATLAS.atlas_pi 414 12 10
group_ATLAS.atlas_pi 0 0 2
group_ATLAS.prodatls 354 36 11
group_CMS.cms.cmssgm 1 0 0
group_CMS.cms_pilot. 371 2223 0
group_CMS.cms_pilot. 0 0 1
group_CMS.cms_pilot. 68 200 0
group_CMS.prodcms.pc 188 1905 10
group_CMS.prodcms.pc 312 3410 0
group_CMS.prodcms_mu 47 102 0
…
condor_q[root@arc-ce01 ~]# condor_q
-- Submitter: arc-ce01.gridpp.rl.ac.uk : <130.246.180.236:64454> : arc-ce01.gridpp.rl.ac.uk
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
794717.0 pcms054 12/3 12:07 0+00:00:00 I 0 0.0 (gridjob )
794718.0 pcms054 12/3 12:07 0+00:00:00 I 0 0.0 (gridjob )
794719.0 pcms054 12/3 12:07 0+00:00:00 I 0 0.0 (gridjob )
794720.0 pcms054 12/3 12:07 0+00:00:00 I 0 0.0 (gridjob )
794721.0 pcms054 12/3 12:07 0+00:00:00 I 0 0.0 (gridjob )
794722.0 pcms054 12/3 12:07 0+00:00:00 I 0 0.0 (gridjob )
794723.0 pcms054 12/3 12:07 0+00:00:00 I 0 0.0 (gridjob )
794725.0 pcms054 12/3 12:07 0+00:00:00 I 0 0.0 (gridjob )
794726.0 pcms054 12/3 12:07 0+00:00:00 I 0 0.0 (gridjob )
…
3502 jobs; 0 completed, 0 removed, 1528 idle, 1965 running, 9 held, 0 suspended
Multi-core jobs-bash-4.1$ condor_q -global -constraint 'RequestCpus > 1’
-- Schedd: arc-ce01.gridpp.rl.ac.uk : <130.246.180.236:39356>
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
832677.0 pcms004 12/5 14:33 0+00:15:07 R 0 2.0 (gridjob )
832717.0 pcms004 12/5 14:37 0+00:12:02 R 0 0.0 (gridjob )
832718.0 pcms004 12/5 14:37 0+00:00:00 I 0 0.0 (gridjob )
832719.0 pcms004 12/5 14:37 0+00:00:00 I 0 0.0 (gridjob )
832893.0 pcms004 12/5 14:47 0+00:00:00 I 0 0.0 (gridjob )
832894.0 pcms004 12/5 14:47 0+00:00:00 I 0 0.0 (gridjob )
…
Multi-core jobs• Custom print format
-bash-4.1$ condor_q -global -pr queue_mc.cpf
-- Schedd: arc-ce01.gridpp.rl.ac.uk : <130.246.180.236:39356>
ID OWNER SUBMITTED RUN_TIME ST SIZE CMD CORES
832677.0 pcms004 12/5 14:33 0+00:00:00 R 2.0 (gridjob) 8
832717.0 pcms004 12/5 14:37 0+00:00:00 R 0.0 (gridjob) 8
832718.0 pcms004 12/5 14:37 0+00:00:00 I 0.0 (gridjob) 8
832719.0 pcms004 12/5 14:37 0+00:00:00 I 0.0 (gridjob) 8
832893.0 pcms004 12/5 14:47 0+00:00:00 I 0.0 (gridjob) 8
832894.0 pcms004 12/5 14:47 0+00:00:00 I 0.0 (gridjob) 8
…
https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=ExperimentalCustomPrintFormats
Jobs with specific DN-bash-4.1$ condor_q -global -constraint 'x509userproxysubject=="/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=atlpilo1/CN=614260/CN=Robot: ATLAS Pilot1”’
-- Schedd: arc-ce03.gridpp.rl.ac.uk : <130.246.181.25:62763>
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
678275.0 tatls015 12/2 17:57 2+06:07:15 R 0 2441.4 (arc_pilot )
681762.0 tatls015 12/3 03:13 1+21:12:31 R 0 2197.3 (arc_pilot )
705153.0 tatls015 12/4 07:36 0+16:49:12 R 0 2197.3 (arc_pilot )
705807.0 tatls015 12/4 08:16 0+16:09:27 R 0 2197.3 (arc_pilot )
705808.0 tatls015 12/4 08:16 0+16:09:27 R 0 2197.3 (arc_pilot )
706612.0 tatls015 12/4 09:16 0+15:09:37 R 0 2197.3 (arc_pilot )
706614.0 tatls015 12/4 09:16 0+15:09:26 R 0 2197.3 (arc_pilot )
…
Jobs killed• Jobs which were removed[root@arc-ce01 ~]# condor_history -constraint 'JobStatus == 3’
ID OWNER SUBMITTED RUN_TIME ST COMPLETED CMD
823881.0 alicesgm 12/5 01:01 1+06:13:22 X ??? /var/spool/arc/grid03/CVuMDmBSwGlnCIXDjqi
831849.0 tlhcb005 12/5 13:19 0+18:52:26 X ??? /var/spool/arc/grid09/gWmLDm5x7GlnCIXDjqi
832753.0 tlhcb005 12/5 14:38 0+17:07:07 X ??? /var/spool/arc/grid00/5wqKDm7C9GlnCIXDjqi
819636.0 alicesgm 12/4 19:27 1+12:13:56 X ??? /var/spool/arc/grid00/mlrNDmoErGlnCIXDjqi
825511.0 alicesgm 12/5 03:03 0+18:52:10 X ??? /var/spool/arc/grid04/XpuKDmxLyGlnCIXDjqi
823799.0 alicesgm 12/5 00:56 1+05:58:15 X ??? /var/spool/arc/grid03/DYuMDmzMwGlnCIXDjqi
820001.0 alicesgm 12/4 19:48 1+06:43:22 X ??? /var/spool/arc/grid08/cmzNDmpYrGlnCIXDjqi
833589.0 alicesgm 12/5 16:01 0+14:06:34 X ??? /var/spool/arc/grid09/HKSLDmqUAHlnCIXDjqi
778644.0 tlhcb005 12/2 05:56 4+00:00:10 X ??? /var/spool/arc/grid00/pIJNDm6cvFlnCIXDjqi
…
Jobs killed• Jobs removed for exceeding memory limit
[root@arc-ce01 ~]# condor_history -constraint 'JobStatus==3 && ResidentSetSize>1024*RequestMemory' -af ClusterId Owner ResidentSetSize RequestMemory
823953 alicesgm 3500000 3000
824438 alicesgm 3250000 3000
820045 alicesgm 3500000 3000
823881 alicesgm 3250000 3000
…
[root@arc-ce04 ~]# condor_history -constraint 'JobStatus==3 && ResidentSetSize>1024*RequestMemory' -af x509UserProxyVOName | sort | uniq -c
515 alice
5 cms
70 lhcb
condor_who• What jobs are currently running on a worker node?[root@lcg1211 ~]# condor_who
OWNER CLIENT SLOT JOB RUNTIME PID PROGRAM
[email protected] arc-ce02.gridpp.rl.ac.uk 1_2 654753.0 0+00:01:54 15743 /usr/libexec/condor/co
[email protected] arc-ce02.gridpp.rl.ac.uk 1_5 654076.0 0+00:56:50 21916 /usr/libexec/condor/co
[email protected] arc-ce04.gridpp.rl.ac.uk 1_10 1337818.0 0+02:51:34 31893 /usr/libexec/condor/co
[email protected] arc-ce04.gridpp.rl.ac.uk 1_7 1337776.0 0+03:06:51 32295 /usr/libexec/condor/co
[email protected] arc-ce02.gridpp.rl.ac.uk 1_1 651508.0 0+05:02:45 17556 /usr/libexec/condor/co
[email protected] arc-ce03.gridpp.rl.ac.uk 1_4 737874.0 0+05:44:24 5032 /usr/libexec/condor/co
[email protected] arc-ce04.gridpp.rl.ac.uk 1_6 1336938.0 0+08:42:18 26911 /usr/libexec/condor/co
[email protected] arc-ce01.gridpp.rl.ac.uk 1_8 826808.0 1+02:50:16 3485 /usr/libexec/condor/co
[email protected] arc-ce03.gridpp.rl.ac.uk 1_3 722597.0 1+08:44:28 22966 /usr/libexec/condor/co
Startd history• If STARTD_HISTORY defined on your WNs
[root@lcg1658 ~]# condor_history
ID OWNER SUBMITTED RUN_TIME ST COMPLETED CMD
841989.0 tatls015 12/6 07:58 0+00:02:39 C 12/6 08:01 /var/spool/arc/grid03/PZ6NDmPQPHlnCIXDjqi
841950.0 tatls015 12/6 07:56 0+00:02:40 C 12/6 07:59 /var/spool/arc/grid03/mckKDm4OPHlnCIXDjqi
841889.0 tatls015 12/6 07:53 0+00:02:33 C 12/6 07:56 /var/spool/arc/grid01/X3bNDmTMPHlnCIXDjqi
841847.0 tatls015 12/6 07:50 0+00:02:35 C 12/6 07:54 /var/spool/arc/grid00/yHHODmfJPHlnCIXDjqi
841816.0 tatls015 12/6 07:48 0+00:02:36 C 12/6 07:51 /var/spool/arc/grid04/iizMDmVHPHlnCIXDjqi
841791.0 tatls015 12/6 07:45 0+00:02:33 C 12/6 07:48 /var/spool/arc/grid00/N3vKDmKEPHlnCIXDjqi
716804.0 alicesgm 12/4 18:28 1+13:15:07 C 12/6 07:44 /var/spool/arc/grid07/TUQNDmUJqGlnzEJDjqI
…
Ganglia• condor_gangliad
– Runs on a single host (can be any host)
– Gathers daemon ClassAds from the collector
– Publishes metrics to ganglia with host spoofing
• At RAL we have on one hostGANGLIAD_VERBOSITY = 2
GANGLIAD_PER_EXECUTE_NODE_METRICS = FalseGANGLIAD = $(LIBEXEC)/condor_gangliadGANGLIA_CONFIG = /etc/gmond.confGANGLIAD_METRICS_CONFIG_DIR = /etc/condor/ganglia.dGANGLIA_SEND_DATA_FOR_ALL_HOSTS = trueDAEMON_LIST = MASTER, GANGLIAD
Ganglia• Small subset from schedd
Ganglia• Small subset from central manager
Easy to make custom plots
Total running, idle, held jobs• f
Running jobs by schedd
Negotiator health• s
Negotiation cycle duration Number of AutoClusters
Draining & multi-core slots
(Some) Third party tools
Job overview• Condor Job Overview Monitor
http://sarkar.web.cern.ch/sarkar/doc/condor_jobview.html
Mimic• Internal RAL application
htcondor-sysview
htcondor-sysview• Hover mouse over a core to get job information
Nagios• Most (all?) sites probably use Nagios or an alternative• At RAL
– Process checks for condor_master on all nodes– Central mangers
• Check for at least 1 collector• Check for the negotiator• Check for worker nodes
Number of startd ClassAds needs to be above a threshold
Number of non-broken worker nodes above a threshold
– CEs• Check for schedd• Job submission test