HTCondor - Useful Features€¦ · Introduction 5 • In this talk I will brieﬂy give an...

HTCondor - Useful FeaturesForrest Phillips [email protected]

August 17th, 2017

Weekly MSU ATLAS Meeting

1

mailto:[email protected]

Why do you care?

2

Why do you care?

3

Why do you care?

4

Introduction

5

• In this talk I will briefly give an introduction to HTCondor, with some simple examples on how to write a condor script for submitting jobs.

• Then I will dive deeper into some abilities that HTCondor has that I learned at the OSG User School this summer. Such as ways to submit multiple jobs from one submit script.

• Finally, I will show a full example of how I used what I learned at the user school to simplify some submission scripts I had and make them more robust (especially for use with Dagman, but that’s another talk).

Submit Scripts and Executables

6

• At it’s simplest, any condor job consists of two things: a submit script and an executable.

• The executable could be as simple as the sleep command or as complex as a shell script or framework executable.

universe=vanilla executable=sleep

log = job.log output = job.out error = job.err

request_cpus = 1 request_disk = 1GB request_mem = 1GB

queue

universe=vanilla executable=script.sh



queue

#!/bin/bash

cp -r ./code $TMPDIR/ cd $TMPDIR

setupATLAS lsetup root

./code/exec

cp results.root /workDisk/ rm -fr ./*


7






queue




queue

#!/bin/bash



./code/exec



8






queue




queue

#!/bin/bash



./code/exec


Passing Arguments

9

• It’s also possible to pass arguments to the executable through the submit script.

universe=vanilla executable=sleep arguments = 60s



queue

universe=vanilla executable=script.sh arguments = opt1 opt2



queue

#!/bin/bash



./code/exec —option $1

cp results.root /$2/ rm -fr ./*

universe=vanilla executable=script.sh arguments = $(var) opt2

log = job.log output = job-$(process).out error = job-$(process).err


queue var in ( thing1 thing2 thing3 thing4 #thing5 thing6 thing7 thing8 thing9 thing10 )

Submitting Multiple Jobs

10

• If you need to submit multiple jobs to condor, there are two ways to do so.

• Create a script that makes many executables and many submit scripts and then submits jobs to condor.

• Or, have a generalized executable and one submit script that submits multiple jobs (recommended for use with Dagman, also simpler in my opinion).

#!/bin/bash



./code —option=$1

cp results.root /$2/

Different Ways to Submit Many Jobs

11

• There are several ways to submit multiple jobs:

• Using “queue <n>” will create n identical jobs (not really what most of us want).

• Using the “in” keyword will create a job for each element in an array. Supports multiple variables.

• Using the “matching” keyword will create a job for each file or directory matching the glob.

• Using the “from” keyword will create a job for each comma separated row in the given file. Supports multiple variables.

queue var in ( thing1 thing2 thing3 thing4 #thing5 thing6 thing7 thing8 thing9 thing10 )

queue var matching *.root

>ls -l ./ File1.root File2.root File3.root … File10.root File11.txt (nice try) File12.root …

queue var1,var2 from list.txt

>cat list.txt 500, 0.5 500, 0.75 500, 1.0 1000, 0.5 1000, 0.75 1000, 1.0

Access variables with $(var) $(var1) $(var2) etc.

So maybe arguments = $(var)

Or arguments=$(var1) $(var2)

Updating My Submit Scripts

12

• In order to use Dagman to it’s full potential (which will be my next talk), I needed to update my submit scripts first.

• Originally, I had one shell script that made one executable and one submit script per mass point (of which there are 25).

• I needed to have one generalized executable and one submit script that submitted all 25 mass points.

#!/bin/bash

<setup a bunch of variables>

for i in massPoints do

#Make Submit Scripts echo “universe=vanilla” > job-$i.sub echo “executable=exec-$i.sh” >> job-$i.sub <several more lines to make submit script>

#Make Executables echo “cp -r ./code $TMPDIR/” > job-$i.sh echo “cd $TMPDIR” >> job-$i.sh <several more lines to finish making executable>

condor_submit job-$i.sub done


13

• First, I needed to figure out what this script was changing each time it made these executables and submit scripts.

• I knew these would need to be options that go into the “arguments” list of the condor submit script.

• These would also need to be variables inside of my generalized executable.

#!/bin/bash






Used in here


14

universe=vanilla executable=job.sh arguments=channel signal outDir $(mass)

log = job.log output = job-$(mass).out error = job-$(mass).err

<request cpus, disk, and memory>

queue mass in (500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 etc.)

#!/bin/bash






Before

#!bin/bash

cp -r ./code $TMPDIR/ cp config_$1_$2_$4.cfg $TMPDIR/

cd $TMPDIR/

./code config_$1_$2_$4.cfg

mv Results/ $3/ rm -fr ./* After

Submitting and Checking Up on Your Jobs/Nodes

15

condor_submit and condor_q

16

• To actually submit your jobs, use the command “condor_submit <submit script>”

• To see how your jobs are doing, use the command “condor_q <username>”

• There are several statuses your job can have: idle (I), running (R), held (H), completed (C), or canceled (X). (Technically on other condor systems there are more, but I’ve never seen them at MSU)

• If you have a held job and would like to see why, you can use the command “condor_q -hold <jobID>”

• If we update condor to the latest version in the future, condor_q will only show your jobs and will collapse ones with the same clusterID into one entry. You can use the -all option to show everyone’s jobs and the -nobatch entry to show each job separately.

$ condor_submit job.submitSubmitting job(s).1 job(s) submitted to cluster 128.


17




• If you have a held job and would like to see why, you can use the command “condor_q -hold <jobID> -af HoldReason”



$ condor_q ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 128.0 forrestp 5/9 11:09 0+00:00:00 I 0 0.0 compare_states wi.dat us.dat

1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended


18




• If you have a held job and would like to see why, you can use the command “condor_q -held <jobID>”



$ condor_q ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 128.0 forrestp 5/9 11:09 0+00:00:00 I 0 0.0 compare_states wi.dat us.dat


$ condor_q OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS

alice CMD: compare_states 5/9 11:05 _ _ 1 1 128.0


Checking Job Failure

19

• If a job fails or is taking a long time, there are three places to start your search: the condor log file, condor output file, and condor error files.

• The log file keeps information about disk usage and memory usage, as a function of time. This is a good place to look if you think your code might have a memory leak.

• The output file keeps all the standard output that would normally be output to the terminal.

• The error file keeps all the standard error that would normally be output to the terminal.

• I’m not actually sure that the tier3 is configured to do this, but condor has a command you can use to ssh into a node that is currently running your job. It is “condor_ssh_to_job <JobID>”.

Holding, Editing, and Releasing Jobs

20

• Say you submit a bunch of jobs and realize that you didn’t request the right amount of memory, or forgot to add concurrency limits, or something of a similar nature.

• You could hold the jobs with “condor_hold <JobID>”, edit the relevant info with “condor_qedit <JobID> RequestMemory 1024”, and then release it with “condor_release <JobID>”.

• If you find one of your jobs was held by condor, you could figure out what happened and then use the last two steps from above.

• If you decide the job is just ruined beyond repair, you can remove it with “condor_rm <JobID>”.

• Similarly, you can remove or hold a whole batch of jobs with the same clusterID with “condor_rm <ClusterID>” or “condor_hold <ClusterID>”.

Checking the Status of Nodes

21

• You can use the command “condor_status” to see how many nodes there are total, how many are being used, how many are free, as well as a lot more information about each individual node if you use “condor_status -l <machine address>”.

forrestp@maron:~$ condor_status Name OpSys Arch State Activity LoadAv Mem ActvtyTime

[email protected] LINUX X86_64 Unclaimed Idle 0.000 1024 0+00:36:48 [email protected] LINUX X86_64 Unclaimed Idle 0.000 1024 0+00:16:09 [email protected] LINUX X86_64 Unclaimed Idle 0.000 1024 0+00:43:44 [email protected] LINUX X86_64 Unclaimed Idle 0.000 1024 0+00:19:30 [email protected] LINUX X86_64 Unclaimed Idle 0.000 1024 0+00:17:47 [email protected] LINUX X86_64 Unclaimed Idle 0.040 4096 1+21:37:50 [email protected] LINUX X86_64 Unclaimed Idle 0.000 4096 0+00:29:38 [email protected] LINUX X86_64 Unclaimed Idle 0.000 4096 0+00:41:26 [email protected] LINUX X86_64 Claimed Busy 1.000 1024 1+21:34:55 [email protected] LINUX X86_64 Claimed Busy 1.000 1024 1+21:34:55 [email protected] LINUX X86_64 Claimed Busy 1.000 1024 1+21:34:55 [email protected] LINUX X86_64 Claimed Busy 1.000 1024 1+21:34:55 [email protected] LINUX X86_64 Claimed Busy 1.000 1024 1+21:34:55 [email protected] LINUX X86_64 Unclaimed Idle 0.140 4096 1+21:35:22 [email protected] LINUX X86_64 Claimed Busy 1.000 4096 1+22:12:00 . . . [email protected] LINUX X86_64 Unclaimed Idle 0.000 4096 1+21:38:48 [email protected] LINUX X86_64 Unclaimed Idle 0.000 4096 0+07:45:34 [email protected] LINUX X86_64 Unclaimed Idle 0.000 4096 8+17:25:01 [email protected] LINUX X86_64 Unclaimed Idle 0.000 4096 16+23:13:48 Machines Owner Claimed Unclaimed Matched Preempting

X86_64/LINUX 476 0 80 396 0 0

Total 476 0 80 396 0 0

Viewing Job History

22

• You can also use “condor_history” if you want to see info about jobs have completed running. (Sorry Kuan)

forrestp@maron:~$ condor_history ID OWNER SUBMITTED RUN_TIME ST COMPLETED CMD 767325.0 linkuany 8/15 19:47 0+00:01:50 C 8/15 19:50 /home/linkuany/t3work9/outputs_limits/R1508/info/limits_wpr_tbhad5000.sh 767323.0 linkuany 8/15 19:47 0+00:01:29 C 8/15 19:49 /home/linkuany/t3work9/outputs_limits/R1508/info/limits_wpr_tbhad4000.sh 767322.0 linkuany 8/15 19:47 0+00:01:18 C 8/15 19:49 /home/linkuany/t3work9/outputs_limits/R1508/info/limits_wpr_tbhad3500.sh 767324.0 linkuany 8/15 19:47 0+00:01:10 C 8/15 19:49 /home/linkuany/t3work9/outputs_limits/R1508/info/limits_wpr_tbhad4500.sh 767318.0 linkuany 8/15 19:47 0+00:01:27 C 8/15 19:49 /home/linkuany/t3work9/outputs_limits/R1508/info/limits_wpr_tbhad2250.sh 767320.0 linkuany 8/15 19:47 0+00:01:21 C 8/15 19:49 /home/linkuany/t3work9/outputs_limits/R1508/info/limits_wpr_tbhad2750.sh 767321.0 linkuany 8/15 19:47 0+00:01:19 C 8/15 19:49 /home/linkuany/t3work9/outputs_limits/R1508/info/limits_wpr_tbhad3000.sh 767316.0 linkuany 8/15 19:47 0+00:01:12 C 8/15 19:49 /home/linkuany/t3work9/outputs_limits/R1508/info/limits_wpr_tbhad1750.sh 767315.0 linkuany 8/15 19:47 0+00:01:12 C 8/15 19:49 /home/linkuany/t3work9/outputs_limits/R1508/info/limits_wpr_tbhad1500.sh 767317.0 linkuany 8/15 19:47 0+00:01:08 C 8/15 19:49 /home/linkuany/t3work9/outputs_limits/R1508/info/limits_wpr_tbhad2000.sh 767319.0 linkuany 8/15 19:47 0+00:01:07 C 8/15 19:49 /home/linkuany/t3work9/outputs_limits/R1508/info/limits_wpr_tbhad2500.sh 767314.0 linkuany 8/15 19:47 0+00:01:05 C 8/15 19:48 /home/linkuany/t3work9/outputs_limits/R1508/info/limits_wpr_tbhad1250.sh 767313.0 linkuany 8/15 19:47 0+00:00:59 C 8/15 19:48 /home/linkuany/t3work9/outputs_limits/R1508/info/limits_wpr_tbhad1000.sh 767285.0 linkuany 8/15 19:46 0+00:00:00 X ??? /home/linkuany/t3work9/outputs_limits/R1508/info/limits_wpr_tbhad4500.sh 767293.0 linkuany 8/15 19:46 0+00:00:00 X ??? /home/linkuany/t3work9/outputs_limits/R1508/info/limits_wpr_tbhad2500.sh 767301.0 linkuany 8/15 19:46 0+00:00:00 X ??? /home/linkuany/t3work9/outputs_limits/R1508/info/limits_wpr_tbhad1250.sh 767309.0 linkuany 8/15 19:46 0+00:00:00 X ??? /home/linkuany/t3work9/outputs_limits/R1508/info/limits_wpr_tbhad3500.sh 767284.0 linkuany 8/15 19:46 0+00:00:00 X ??? /home/linkuany/t3work9/outputs_limits/R1508/info/limits_wpr_tbhad4000.sh 767292.0 linkuany 8/15 19:46 0+00:00:00 X ??? /home/linkuany/t3work9/outputs_limits/R1508/info/limits_wpr_tbhad2250.sh 767300.0 linkuany 8/15 19:46 0+00:00:00 X ??? /home/linkuany/t3work9/outputs_limits/R1508/info/limits_wpr_tbhad1000.sh 767308.0 linkuany 8/15 19:46 0+00:00:00 X ??? /home/linkuany/t3work9/outputs_limits/R1508/info/limits_wpr_tbhad3000.sh 767261.0 linkuany 8/15 19:46 0+00:00:20 X ??? /home/linkuany/t3work9/outputs_limits/R1508/info/limits_wpr_tbhad1000.sh 767269.0 linkuany 8/15 19:46 0+00:00:20 X ??? /home/linkuany/t3work9/outputs_limits/R1508/info/limits_wpr_tbhad3000.sh 767283.0 linkuany 8/15 19:46 0+00:00:00 X ??? /home/linkuany/t3work9/outputs_limits/R1508/info/limits_wpr_tbhad3500.sh 767291.0 linkuany 8/15 19:46 0+00:00:00 X ??? /home/linkuany/t3work9/outputs_limits/R1508/info/limits_wpr_tbhad2000.sh 767299.0 linkuany 8/15 19:46 0+00:00:00 X ??? /home/linkuany/t3work9/outputs_limits/R1508/info/limits_wpr_tbhad5000.sh 767307.0 linkuany 8/15 19:46 0+00:00:00 X ??? /home/linkuany/t3work9/outputs_limits/R1508/info/limits_wpr_tbhad2750.sh 767282.0 linkuany 8/15 19:46 0+00:00:00 X ??? /home/linkuany/t3work9/outputs_limits/R1508/info/limits_wpr_tbhad3000.sh 767290.0 linkuany 8/15 19:46 0+00:00:00 X ??? /home/linkuany/t3work9/outputs_limits/R1508/info/limits_wpr_tbhad1750.sh 767298.0 linkuany 8/15 19:46 0+00:00:00 X ??? /home/linkuany/t3work9/outputs_limits/R1508/info/limits_wpr_tbhad4500.sh

Interactive Jobs

23

• You can submit a job interactively with “condor_submit -I <submit script>”.

• This will put you on a worker node and leave all the script executing to you.

Summary

24

• Condor can make your life even easier than it already does.

• It can submit multiple jobs for you if you let it.

• Lots of different commands for checking on the status of your jobs and the nodes.

• Next week or the week after I will give a talk on using DagMan, a condor tool that allows you to submit multiple steps all at once. So say you have two submit scripts (A and B) and A needs to run before B starts, DagMan can take care of that for you (as well as any scripts that need to take place before or after each set of jobs).

HTCondor - Useful Features€¦ · Introduction 5 • In this talk I will brieﬂy give an...

Documents

Transcript of HTCondor - Useful Features€¦ · Introduction 5 • In this talk I will brieﬂy give an...