HTCondor at the RAL Tier-1

Andrew Lahiff

Overview• Current status at RAL• Multi-core jobs• Interesting features we’re using• Some interesting features & commands• Future plans

Current status at RAL

Background• Computing resources

– 784 worker nodes, over 14K cores– Generally have 40-60K jobs submitted per day

• Torque / Maui had been used for many years– Many issues– Severity & number of problems increased as size of farm

increased– Doesn’t like dynamic resources

• A problem if we want to extend batch system into the cloud– In 2012 decided it was time to start investigating moving to a

new batch system

Choosing a new batch system• Considered, tested & eventually rejected the following

– LSF, Univa Grid Engine*• Requirement: avoid commercial products unless absolutely necessary

– Open source Grid Engines• Competing products, not sure which has the next long term future• Communities appear less active than HTCondor & SLURM• Existing Tier-1s running Grid Engine using the commercial version

– Torque 4 / Maui• Maui problematic• Torque 4 seems less scalable than alternatives (but better than Torque 2)

– SLURM• Carried out extensive testing & comparison with HTCondor• Found that for our use case

Very fragile, easy to breakUnable to get reliably working above 6000 running jobs

* Only tested open source Grid Engine, not Univa Grid Engine

Choosing a new batch system• HTCondor chosen as replacement for Torque/Maui

– Has the features we require– Seems very stable– Easily able to run 16,000 simultaneous jobs

• Didn’t do any tuning – it “just worked”• Have since tested > 30,000 running jobs

– Is more customizable than all other batch systems

The story so far• History of HTCondor at RAL

– Jun 2013: started testing with real ATLAS & CMS jobs– Sep 2013: 50% pledged resources moved to HTCondor– Nov 2013: fully migrated to HTCondor

• Experience– Very stable operation– No changes needed as the HTCondor pool increased in size

from ~1000 to ~14000 cores– Job start rate much higher than Torque / Maui even when

throttled– Very good support


Central manager



Worker nodescondor_startd


Our setup• 2 central managers

– condor_master– condor_collector– condor_HAD (responsible for high-availability)– condor_replication (responsible for high-availability)– condor_negotiator (only running on at most 1 machine at a

time)• 8 submission hosts (4 ARC CE, 2 CREAM CE, 2 UI)

– condor_master– condor_schedd

• Lots of worker nodes– condor_master– condor_startd

• Monitoring box (runs 8.1.x which contains ganglia integration)– condor_master– condor_gangliad

Computing elements• ARC experience so far

– Have run over 9.4 million jobs so far across our ARC CEs– Generally ignore them & they “just work”

• VO status– ALTAS & CMS

• Fine from the beginning– LHCb

• Added ability to DIRAC to submit to ARC– ALICE

• Not yet able to submit to ARC, have said they will work on this– Non-LHC VOs

• Some use DIRAC, which can now submit to ARC• Some use EMI WMS, which can submit to ARC

Computing elements• ARC 4.1.0 released recently

– Will be in UMD very soon• Has just passed through staged-rollout

– Contains all of our fixes to HTCondor backend scripts• Plans

– When VOs start using RFC proxies we could enable the web service interface

• Doesn’t affect ATLAS/CMS• VOs using NorduGrid client commands (e.g. LHCb) can get job status

information more quickly

Computing elements• Alternative: HTCondor-CE

– Special configuration of HTCondor, not a brand new service– Some sites starting to use this in the US– Note: contains no BDII (!)




job router

ATLAS & CMSpilot factories




Central manager(s)

Worker nodes

Multi-core jobs

Getting multi-core jobs to work• Job submission

– Haven’t set up dedicated queues– VO has to request how many cores they want in their JDL

• Fine for ATLAS & CMS, not sure yet for LHCb/DIRAC…• Could set up additional queues if necessary

• Did 5 things on the HTCondor side to support multi-core jobs…

Getting multi-core jobs to work• Worker nodes configured to use partitionable slots

– WN resources divided up as necessary amongst jobs– We had this configuration anyway

• Setup multi-core accounting groups & associated quotas– Configured so that multi-core jobs automatically get assigned

to the appropriate groups– Specified group quotas (fairshares) for the multi-core groups

• Adjusted the order in which the negotiator considers groups– Consider multi-core groups before single core groups

• 8 free slots are “expensive” to obtain, so try not to lose them too quickly

Getting multi-core jobs to work• Setup condor_defrag daemon

– Finds WNs to drain, triggers draining & cancels draining as required

– Pick WNs to drain based on how many cores they have that can be freed up• E.g. getting 8 free cores by draining a full 32 core WN is

generally faster than draining a full 8 core WN

Getting multi-core jobs to work• Improvement to condor_defrag daemon

– Demand for multi-core jobs not known by condor_defrag– Setup simple cron which adjusts number of concurrent

draining WNs based on demand• If many idle multi-core jobs but few running, drain aggressively• Otherwise very little draining

ResultsRunning & idle multi-core jobs

Gaps in submission by ATLAS resultsin loss of multi-core slots.

Number of WNs running multi-core jobs & draining WNs

Significantly reduced CPU wastagedue to the cron• Aggressive draining: 3% waste• Less-aggressive draining: < 1% waste

Multi-core jobs• Current status

– Haven’t made any changes over the past few months– Now only CMS running multi-core jobs

• Waiting for ATLAS to start up again• Will be interesting to see what happens when multiple VOs

run multi-core jobs– May look at making improvements if necessary

• Details about our configuration here:

Interesting features we’re using(or are about to use)

Startd cron• Worker node health check script

– Script run on WN at regular intervals by HTCondor– Can place custom information into WN ClassAds. In our

case:• NODE_IS_HEALTHY (should WN start jobs or not)• NODE_STATUS (list of any problems)


## When is this node willing to run jobs?START = (NODE_IS_HEALTHY =?= True)

Startd cron• Current checks

– CVMFS– Filesystem problems (e.g. read-only)– Swap usage

• Plans– May add more checks, e.g. CPU load– More thorough tests only run when HTCondor first starts up?

• E.g. checks for grid middleware, checks for other essential software & configuration, …

• Want to be sure that jobs will never be started unless the WN is correctly set up

• Will be more important for dynamic virtual WNs

Startd cron• Easily list any WNs with problems, e.g.

# condor_status -constraint 'partitionableslot == True && NODE_STATUS != "All_OK”’-autoformat Machine Problem: CVMFS for Problem: CVMFS for Problem: CVMFS for Problem: CVMFS for

gmetric script using HTCondorPython API for making Gangliaplots

PID namespaces• We have USE_PID_NAMESPACES=True on WNs

– Jobs can’t see any system processes or processes associated with other jobs on the WN

– Example stdout of job running “ps –e”: PID TTY TIME CMD 1 ? 00:00:00 condor_exec.exe 3 ? 00:00:00 ps

MOUNT_UNDER_SCRATCH• Each job sees a different /tmp, /var/tmp

– Uses bind mounts to directories inside job scratch area• No more junk left behind in /tmp• Jobs can’t fill /tmp & cause problems• Jobs can’t see what other jobs have written into /tmp

• For glexec jobs to work, need a special lcmaps plugin enabled – lcmaps-plugins-mount-under-scratch– Minor tweak to lcmaps.db

• We have tested this but it’s not yet rolled out

CPU affinity• Can set on WN ASSIGN_CPU_AFFINITY=True• Jobs locked to specific cores• Problem:

– When PID namespaces also used, CPU affinity doesn’t work

Control groups• Cgroups: mechanism for managing a set of processes• We’re starting with the most basic option:

CGROUP_MEMORY_LIMIT_POLICY=none– No memory limits applied– Cgroups used for

• Process tracking• Memory accounting• CPU usage assigned proportionally to the number of CPUs in

the slot– Currently configured on 2 WNs for initial testing with

production jobs

Some interesting features & commands

Upgrades• Central managers, CEs:

– Update the RPMs– condor_master will notice the binaries have changed &

restart daemons as required• Worker nodes


– Update the RPMs– condor_master will notice the binaries have changed,

drain running jobs, then restart daemons as required

Dealing with held jobs• Held jobs have failed in some way & remain in the queue

waiting for user intervention– E.g. input file(s) missing from CE

• Can configure HTCondor to deal with them automatically– Try re-running held jobs once only after waiting 30 minutes

SYSTEM_PERIODIC_RELEASE = ((CurrentTime - EnteredCurrentStatus > 30 * 60) && (JobRunCount < 2))

– Remove held jobs after 24 hoursSYSTEM_PERIODIC_REMOVE = ((CurrentTime - EnteredCurrentStatus > 24 * 60 * 60) && JobStatus == 5))

condor_who• See what jobs are running on a worker node[root@lcg0975 ~]# condor_who

OWNER CLIENT SLOT JOB RUNTIME PID PROGRAM 1_5 2730014.0 0+03:30:12 30184 /pool/condor/dir_30180/ 1_7 3189534.0 0+03:38:19 21266 /pool/condor/dir_21262/ 1_6 2729613.0 0+04:40:21 6942 /pool/condor/dir_6938/condor_exec.exe 1_4 2977866.0 0+08:42:13 26669 /pool/condor/dir_26665/ 1_1 3186150.0 0+10:57:05 12401 /pool/condor/dir_12342/ 1_3 3174829.0 0+16:03:57 26418 /pool/condor/dir_26331/condor_exec.exe 1_2 3149655.0 1+23:55:51 31281 /pool/condor/dir_31268/condor_exec.exe

condor_q –analyze 1

• Why isn’t a job running?

-bash-4.1$ condor_q -analyze 16244

-- Submitter: : <> : priority for is not available, attempting to analyze without it.---16244.000: Run analysis summary. Of 13180 machines, 13180 are rejected by your job's requirements 0 reject your job because of their own requirements 0 match and are already running your jobs 0 match but are serving other users 0 are available to run your job

WARNING: Be advised: No resources matched request's constraints

condor_q –analyze 2

The Requirements expression for your job is:

( ( Ceph is true ) ) && ( TARGET.Arch == "X86_64" ) && ( TARGET.OpSys == "LINUX" ) && ( TARGET.Disk >= RequestDisk ) && ( TARGET.Memory >= RequestMemory ) && ( TARGET.Cpus >= RequestCpus ) && ( TARGET.HasFileTransfer )


Condition Machines Matched Suggestion --------- ---------------- ----------1 ( TARGET.Cpus >= 16 ) 0 MODIFY TO 82 ( ( target.Ceph is true ) ) 139 3 ( TARGET.Memory >= 1 ) 13150 4 ( TARGET.Arch == "X86_64" ) 13180 5 ( TARGET.OpSys == "LINUX" ) 13180 6 ( TARGET.Disk >= 1 ) 13180 7 ( TARGET.HasFileTransfer ) 13180

condor_ssh_to_job• ssh into a job from a CE[root@arc-ce01 ~]# condor_ssh_to_job 3147487.0Welcome to!Your condor job is running with pid(s) 2402.[pcms054@lcg1554 dir_2393]$[pcms054@lcg1554 dir_2393]$

condor_fetchlog• Retrieve log files from daemons on other machines[root@condor01 ~]# condor_fetchlog SCHEDD05/20/14 12:39:09 (pid:2388) ******************************************************05/20/14 12:39:09 (pid:2388) ** condor_schedd (CONDOR_SCHEDD) STARTING UP05/20/14 12:39:09 (pid:2388) ** /usr/sbin/condor_schedd05/20/14 12:39:09 (pid:2388) ** SubsystemInfo: name=SCHEDD type=SCHEDD(5) class=DAEMON(1)05/20/14 12:39:09 (pid:2388) ** Configuration: subsystem:SCHEDD local:<NONE> class:DAEMON05/20/14 12:39:09 (pid:2388) ** $CondorVersion: 8.0.6 Feb 01 2014 BuildID: 225363 $05/20/14 12:39:09 (pid:2388) ** $CondorPlatform: x86_RedHat6 $05/20/14 12:39:09 (pid:2388) ** PID = 238805/20/14 12:39:09 (pid:2388) ** Log last touched time unavailable (No such file or directory)05/20/14 12:39:09 (pid:2388) ******************************************************...

condor_gather_info• Gathers information about a job

– Including log files from schedd, startd

[root@arc-ce01 ~]# condor_gather_info --jobid 3288142.0


Job ClassAds• Lots of useful information in job ClassAds

– Including email address from proxy• Easy to contact users of problematic jobs

# condor_q 3279852.0 -autoformat

condor_chirp• Jobs can put custom information into job ClassAds• Example: lcmaps-plugin-condor-update

– Puts information into job ClassAd about glexec payload user & DN

– Can then use condor_q to see this information

Job router• Job router daemon transforms jobs from one type to

another according to configurable policies– E.g. submit jobs to a different batch system or a CE

• Example: sending excess jobs to the GridPP Cloud using glideinWMS

JOB_ROUTER_DEFAULTS = \ [ \ MaxIdleJobs = 10; \ MaxJobs = 50; \ ]JOB_ROUTER_ENTRIES = \ [ \ Requirements=true; \ GridResource = "condor"; \ name = "GridPP_Cloud"; \ ]

Job router• Example: initially have 5 idle jobs-bash-4.1$ condor_q

-- Submitter: : <> : ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 2249.0 alahiff 6/2 20:57 0+00:00:00 I 0 0.0 (CMSAnalysis )2249.1 alahiff 6/2 20:57 0+00:00:00 I 0 0.0 (CMSAnalysis )2249.2 alahiff 6/2 20:57 0+00:00:00 I 0 0.0 (CMSAnalysis )2249.3 alahiff 6/2 20:57 0+00:00:00 I 0 0.0 (CMSAnalysis )2249.4 alahiff 6/2 20:57 0+00:00:00 I 0 0.0 (CMSAnalysis )

5 jobs; 0 completed, 0 removed, 5 idle, 0 running, 0 held, 0 suspended

Job router• Routed copies of the jobs soon appear

– Original job mirrors the status of the routed copy

-bash-4.1$ condor_q

-- Submitter: : <> : ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 2249.0 alahiff 6/2 20:57 0+00:00:00 I 0 0.0 (CMSAnalysis )2249.1 alahiff 6/2 20:57 0+00:00:00 I 0 0.0 (CMSAnalysis )2249.2 alahiff 6/2 20:57 0+00:00:00 I 0 0.0 (CMSAnalysis )2249.3 alahiff 6/2 20:57 0+00:00:00 I 0 0.0 (CMSAnalysis )2249.4 alahiff 6/2 20:57 0+00:00:00 I 0 0.0 (CMSAnalysis )2250.0 alahiff 6/2 20:57 0+00:00:00 I 0 0.0 (CMSAnalysis )2251.0 alahiff 6/2 20:57 0+00:00:00 I 0 0.0 (CMSAnalysis )2252.0 alahiff 6/2 20:57 0+00:00:00 I 0 0.0 (CMSAnalysis )2253.0 alahiff 6/2 20:57 0+00:00:00 I 0 0.0 (CMSAnalysis )2254.0 alahiff 6/2 20:57 0+00:00:00 I 0 0.0 (CMSAnalysis )

10 jobs; 0 completed, 0 removed, 10 idle, 0 running, 0 held, 0 suspended

Job router• Can check that the new jobs have been sent to a remote

resource[root@lcgvm21 ~]# condor_q -grid

-- Submitter: : <> : ID OWNER STATUS GRID->MANAGER HOST GRID_JOB_ID 2250.0 alahiff IDLE condor->lcggwms02.gridpp.rl 20.0 2251.0 alahiff IDLE condor->lcggwms02.gridpp.rl 21.0 2252.0 alahiff IDLE condor->lcggwms02.gridpp.rl 17.0 2253.0 alahiff IDLE condor->lcggwms02.gridpp.rl 18.0 2254.0 alahiff IDLE condor->lcggwms02.gridpp.rl 19.0

glideinWMS HTCondor pool

Job ids on remote resource

Future plans• Setting up ARC CE & some WNs using Ceph as a shared

storage system (CephFS)– ATLAS testing with arcControlTower

• Pulls jobs from PanDA, pushes jobs to ARC CEs– Input files pre-staged & cached on Ceph by ARC– Currently in progress…

Future plans• Test power management features

– HTCondor can power down idle WNs and wake them as required

• Batch system expanding into the cloud– Make use of idle private cloud resources– We have tested that condor_rooster can be used to

dynamically provision VMs as they are needed


Backup slides


Overview• Batch system monitoring

– Mimic– Jobview– Ganglia– Elasticsearch

Mimic• Overview of state of worker nodes


Ganglia• Custom gmetric scripts + condor_gangliad


Aim to move to usingmetrics only fromcondor_gangliad as muchas possible(easy to share with other sites)

Jobs monitoring• CASTOR team at RAL have been testing Elasticsearch

– Why not try using it with HTCondor?• Elasticsearch ELK stack

– Logstash: parses log files– Elasticsearch: search & analyze data in real-time– Kibana: data visualization

• Hardware setup– Test cluster of 13 servers (old diskservers & worker nodes)

• But 3 servers could handle 16 GB of CASTOR logs per day• Adding HTCondor

– Early testing phase only– Wrote config file for Logstash to enable history files to be parsed– Add Logstash to machines running schedds

Jobs monitoring• Full job ClassAds visible & can be queried

Jobs monitoring• Can make custom plots & dashboards