HTCondor - Useful Features€¦ · Introduction 5 • In this talk I will briefly give an...
Transcript of HTCondor - Useful Features€¦ · Introduction 5 • In this talk I will briefly give an...
HTCondor - Useful FeaturesForrest Phillips [email protected]
August 17th, 2017
Weekly MSU ATLAS Meeting
1
Why do you care?
2
Why do you care?
3
Why do you care?
4
Introduction
5
• In this talk I will briefly give an introduction to HTCondor, with some simple examples on how to write a condor script for submitting jobs.
• Then I will dive deeper into some abilities that HTCondor has that I learned at the OSG User School this summer. Such as ways to submit multiple jobs from one submit script.
• Finally, I will show a full example of how I used what I learned at the user school to simplify some submission scripts I had and make them more robust (especially for use with Dagman, but that’s another talk).
Submit Scripts and Executables
6
• At it’s simplest, any condor job consists of two things: a submit script and an executable.
• The executable could be as simple as the sleep command or as complex as a shell script or framework executable.
universe=vanilla executable=sleep
log = job.log output = job.out error = job.err
request_cpus = 1 request_disk = 1GB request_mem = 1GB
queue
universe=vanilla executable=script.sh
log = job.log output = job.out error = job.err
request_cpus = 1 request_disk = 1GB request_mem = 1GB
queue
#!/bin/bash
cp -r ./code $TMPDIR/ cd $TMPDIR
setupATLAS lsetup root
./code/exec
cp results.root /workDisk/ rm -fr ./*
Submit Scripts and Executables
7
• At it’s simplest, any condor job consists of two things: a submit script and an executable.
• The executable could be as simple as the sleep command or as complex as a shell script or framework executable.
universe=vanilla executable=sleep
log = job.log output = job.out error = job.err
request_cpus = 1 request_disk = 1GB request_mem = 1GB
queue
universe=vanilla executable=script.sh
log = job.log output = job.out error = job.err
request_cpus = 1 request_disk = 1GB request_mem = 1GB
queue
#!/bin/bash
cp -r ./code $TMPDIR/ cd $TMPDIR
setupATLAS lsetup root
./code/exec
cp results.root /workDisk/ rm -fr ./*
Submit Scripts and Executables
8
• At it’s simplest, any condor job consists of two things: a submit script and an executable.
• The executable could be as simple as the sleep command or as complex as a shell script or framework executable.
universe=vanilla executable=sleep
log = job.log output = job.out error = job.err
request_cpus = 1 request_disk = 1GB request_mem = 1GB
queue
universe=vanilla executable=script.sh
log = job.log output = job.out error = job.err
request_cpus = 1 request_disk = 1GB request_mem = 1GB
queue
#!/bin/bash
cp -r ./code $TMPDIR/ cd $TMPDIR
setupATLAS lsetup root
./code/exec
cp results.root /workDisk/ rm -fr ./*
Passing Arguments
9
• It’s also possible to pass arguments to the executable through the submit script.
universe=vanilla executable=sleep arguments = 60s
log = job.log output = job.out error = job.err
request_cpus = 1 request_disk = 1GB request_mem = 1GB
queue
universe=vanilla executable=script.sh arguments = opt1 opt2
log = job.log output = job.out error = job.err
request_cpus = 1 request_disk = 1GB request_mem = 1GB
queue
#!/bin/bash
cp -r ./code $TMPDIR/ cd $TMPDIR
setupATLAS lsetup root
./code/exec —option $1
cp results.root /$2/ rm -fr ./*
universe=vanilla executable=script.sh arguments = $(var) opt2
log = job.log output = job-$(process).out error = job-$(process).err
request_cpus = 1 request_disk = 1GB request_mem = 1GB
queue var in ( thing1 thing2 thing3 thing4 #thing5 thing6 thing7 thing8 thing9 thing10 )
Submitting Multiple Jobs
10
• If you need to submit multiple jobs to condor, there are two ways to do so.
• Create a script that makes many executables and many submit scripts and then submits jobs to condor.
• Or, have a generalized executable and one submit script that submits multiple jobs (recommended for use with Dagman, also simpler in my opinion).
#!/bin/bash
cp -r ./code $TMPDIR/ cd $TMPDIR
setupATLAS lsetup root
./code —option=$1
cp results.root /$2/
Different Ways to Submit Many Jobs
11
• There are several ways to submit multiple jobs:
• Using “queue <n>” will create n identical jobs (not really what most of us want).
• Using the “in” keyword will create a job for each element in an array. Supports multiple variables.
• Using the “matching” keyword will create a job for each file or directory matching the glob.
• Using the “from” keyword will create a job for each comma separated row in the given file. Supports multiple variables.
queue var in ( thing1 thing2 thing3 thing4 #thing5 thing6 thing7 thing8 thing9 thing10 )
queue var matching *.root
>ls -l ./ File1.root File2.root File3.root … File10.root File11.txt (nice try) File12.root …
queue var1,var2 from list.txt
>cat list.txt 500, 0.5 500, 0.75 500, 1.0 1000, 0.5 1000, 0.75 1000, 1.0
Access variables with $(var) $(var1) $(var2) etc.
So maybe arguments = $(var)
Or arguments=$(var1) $(var2)
Updating My Submit Scripts
12
• In order to use Dagman to it’s full potential (which will be my next talk), I needed to update my submit scripts first.
• Originally, I had one shell script that made one executable and one submit script per mass point (of which there are 25).
• I needed to have one generalized executable and one submit script that submitted all 25 mass points.
#!/bin/bash
<setup a bunch of variables>
for i in massPoints do
#Make Submit Scripts echo “universe=vanilla” > job-$i.sub echo “executable=exec-$i.sh” >> job-$i.sub <several more lines to make submit script>
#Make Executables echo “cp -r ./code $TMPDIR/” > job-$i.sh echo “cd $TMPDIR” >> job-$i.sh <several more lines to finish making executable>
condor_submit job-$i.sub done
Updating My Submit Scripts
13
• First, I needed to figure out what this script was changing each time it made these executables and submit scripts.
• I knew these would need to be options that go into the “arguments” list of the condor submit script.
• These would also need to be variables inside of my generalized executable.
#!/bin/bash
<setup a bunch of variables>
for i in massPoints do
#Make Submit Scripts echo “universe=vanilla” > job-$i.sub echo “executable=exec-$i.sh” >> job-$i.sub <several more lines to make submit script>
#Make Executables echo “cp -r ./code $TMPDIR/” > job-$i.sh echo “cd $TMPDIR” >> job-$i.sh <several more lines to finish making executable>
condor_submit job-$i.sub done
Used in here
Updating My Submit Scripts
14
universe=vanilla executable=job.sh arguments=channel signal outDir $(mass)
log = job.log output = job-$(mass).out error = job-$(mass).err
<request cpus, disk, and memory>
queue mass in (500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 etc.)
#!/bin/bash
<setup a bunch of variables>
for i in massPoints do
#Make Submit Scripts echo “universe=vanilla” > job-$i.sub echo “executable=exec-$i.sh” >> job-$i.sub <several more lines to make submit script>
#Make Executables echo “cp -r ./code $TMPDIR/” > job-$i.sh echo “cd $TMPDIR” >> job-$i.sh <several more lines to finish making executable>
condor_submit job-$i.sub done
Before
#!bin/bash
cp -r ./code $TMPDIR/ cp config_$1_$2_$4.cfg $TMPDIR/
cd $TMPDIR/
./code config_$1_$2_$4.cfg
mv Results/ $3/ rm -fr ./* After
Submitting and Checking Up on Your Jobs/Nodes
15
condor_submit and condor_q
16
• To actually submit your jobs, use the command “condor_submit <submit script>”
• To see how your jobs are doing, use the command “condor_q <username>”
• There are several statuses your job can have: idle (I), running (R), held (H), completed (C), or canceled (X). (Technically on other condor systems there are more, but I’ve never seen them at MSU)
• If you have a held job and would like to see why, you can use the command “condor_q -hold <jobID>”
• If we update condor to the latest version in the future, condor_q will only show your jobs and will collapse ones with the same clusterID into one entry. You can use the -all option to show everyone’s jobs and the -nobatch entry to show each job separately.
$ condor_submit job.submitSubmitting job(s).1 job(s) submitted to cluster 128.
condor_submit and condor_q
17
• To actually submit your jobs, use the command “condor_submit <submit script>”
• To see how your jobs are doing, use the command “condor_q <username>”
• There are several statuses your job can have: idle (I), running (R), held (H), completed (C), or canceled (X). (Technically on other condor systems there are more, but I’ve never seen them at MSU)
• If you have a held job and would like to see why, you can use the command “condor_q -hold <jobID> -af HoldReason”
• If we update condor to the latest version in the future, condor_q will only show your jobs and will collapse ones with the same clusterID into one entry. You can use the -all option to show everyone’s jobs and the -nobatch entry to show each job separately.
$ condor_submit job.submitSubmitting job(s).1 job(s) submitted to cluster 128.
$ condor_q ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 128.0 forrestp 5/9 11:09 0+00:00:00 I 0 0.0 compare_states wi.dat us.dat
1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended
condor_submit and condor_q
18
• To actually submit your jobs, use the command “condor_submit <submit script>”
• To see how your jobs are doing, use the command “condor_q <username>”
• There are several statuses your job can have: idle (I), running (R), held (H), completed (C), or canceled (X). (Technically on other condor systems there are more, but I’ve never seen them at MSU)
• If you have a held job and would like to see why, you can use the command “condor_q -held <jobID>”
• If we update condor to the latest version in the future, condor_q will only show your jobs and will collapse ones with the same clusterID into one entry. You can use the -all option to show everyone’s jobs and the -nobatch entry to show each job separately.
$ condor_submit job.submitSubmitting job(s).1 job(s) submitted to cluster 128.
$ condor_q ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 128.0 forrestp 5/9 11:09 0+00:00:00 I 0 0.0 compare_states wi.dat us.dat
1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended
$ condor_q OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS
alice CMD: compare_states 5/9 11:05 _ _ 1 1 128.0
1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended
Checking Job Failure
19
• If a job fails or is taking a long time, there are three places to start your search: the condor log file, condor output file, and condor error files.
• The log file keeps information about disk usage and memory usage, as a function of time. This is a good place to look if you think your code might have a memory leak.
• The output file keeps all the standard output that would normally be output to the terminal.
• The error file keeps all the standard error that would normally be output to the terminal.
• I’m not actually sure that the tier3 is configured to do this, but condor has a command you can use to ssh into a node that is currently running your job. It is “condor_ssh_to_job <JobID>”.
Holding, Editing, and Releasing Jobs
20
• Say you submit a bunch of jobs and realize that you didn’t request the right amount of memory, or forgot to add concurrency limits, or something of a similar nature.
• You could hold the jobs with “condor_hold <JobID>”, edit the relevant info with “condor_qedit <JobID> RequestMemory 1024”, and then release it with “condor_release <JobID>”.
• If you find one of your jobs was held by condor, you could figure out what happened and then use the last two steps from above.
• If you decide the job is just ruined beyond repair, you can remove it with “condor_rm <JobID>”.
• Similarly, you can remove or hold a whole batch of jobs with the same clusterID with “condor_rm <ClusterID>” or “condor_hold <ClusterID>”.
Checking the Status of Nodes
21
• You can use the command “condor_status” to see how many nodes there are total, how many are being used, how many are free, as well as a lot more information about each individual node if you use “condor_status -l <machine address>”.
forrestp@maron:~$ condor_status Name OpSys Arch State Activity LoadAv Mem ActvtyTime
[email protected] LINUX X86_64 Unclaimed Idle 0.000 1024 0+00:36:48 [email protected] LINUX X86_64 Unclaimed Idle 0.000 1024 0+00:16:09 [email protected] LINUX X86_64 Unclaimed Idle 0.000 1024 0+00:43:44 [email protected] LINUX X86_64 Unclaimed Idle 0.000 1024 0+00:19:30 [email protected] LINUX X86_64 Unclaimed Idle 0.000 1024 0+00:17:47 [email protected] LINUX X86_64 Unclaimed Idle 0.040 4096 1+21:37:50 [email protected] LINUX X86_64 Unclaimed Idle 0.000 4096 0+00:29:38 [email protected] LINUX X86_64 Unclaimed Idle 0.000 4096 0+00:41:26 [email protected] LINUX X86_64 Claimed Busy 1.000 1024 1+21:34:55 [email protected] LINUX X86_64 Claimed Busy 1.000 1024 1+21:34:55 [email protected] LINUX X86_64 Claimed Busy 1.000 1024 1+21:34:55 [email protected] LINUX X86_64 Claimed Busy 1.000 1024 1+21:34:55 [email protected] LINUX X86_64 Claimed Busy 1.000 1024 1+21:34:55 [email protected] LINUX X86_64 Unclaimed Idle 0.140 4096 1+21:35:22 [email protected] LINUX X86_64 Claimed Busy 1.000 4096 1+22:12:00 . . . [email protected] LINUX X86_64 Unclaimed Idle 0.000 4096 1+21:38:48 [email protected] LINUX X86_64 Unclaimed Idle 0.000 4096 0+07:45:34 [email protected] LINUX X86_64 Unclaimed Idle 0.000 4096 8+17:25:01 [email protected] LINUX X86_64 Unclaimed Idle 0.000 4096 16+23:13:48 Machines Owner Claimed Unclaimed Matched Preempting
X86_64/LINUX 476 0 80 396 0 0
Total 476 0 80 396 0 0
Viewing Job History
22
• You can also use “condor_history” if you want to see info about jobs have completed running. (Sorry Kuan)
forrestp@maron:~$ condor_history ID OWNER SUBMITTED RUN_TIME ST COMPLETED CMD 767325.0 linkuany 8/15 19:47 0+00:01:50 C 8/15 19:50 /home/linkuany/t3work9/outputs_limits/R1508/info/limits_wpr_tbhad5000.sh 767323.0 linkuany 8/15 19:47 0+00:01:29 C 8/15 19:49 /home/linkuany/t3work9/outputs_limits/R1508/info/limits_wpr_tbhad4000.sh 767322.0 linkuany 8/15 19:47 0+00:01:18 C 8/15 19:49 /home/linkuany/t3work9/outputs_limits/R1508/info/limits_wpr_tbhad3500.sh 767324.0 linkuany 8/15 19:47 0+00:01:10 C 8/15 19:49 /home/linkuany/t3work9/outputs_limits/R1508/info/limits_wpr_tbhad4500.sh 767318.0 linkuany 8/15 19:47 0+00:01:27 C 8/15 19:49 /home/linkuany/t3work9/outputs_limits/R1508/info/limits_wpr_tbhad2250.sh 767320.0 linkuany 8/15 19:47 0+00:01:21 C 8/15 19:49 /home/linkuany/t3work9/outputs_limits/R1508/info/limits_wpr_tbhad2750.sh 767321.0 linkuany 8/15 19:47 0+00:01:19 C 8/15 19:49 /home/linkuany/t3work9/outputs_limits/R1508/info/limits_wpr_tbhad3000.sh 767316.0 linkuany 8/15 19:47 0+00:01:12 C 8/15 19:49 /home/linkuany/t3work9/outputs_limits/R1508/info/limits_wpr_tbhad1750.sh 767315.0 linkuany 8/15 19:47 0+00:01:12 C 8/15 19:49 /home/linkuany/t3work9/outputs_limits/R1508/info/limits_wpr_tbhad1500.sh 767317.0 linkuany 8/15 19:47 0+00:01:08 C 8/15 19:49 /home/linkuany/t3work9/outputs_limits/R1508/info/limits_wpr_tbhad2000.sh 767319.0 linkuany 8/15 19:47 0+00:01:07 C 8/15 19:49 /home/linkuany/t3work9/outputs_limits/R1508/info/limits_wpr_tbhad2500.sh 767314.0 linkuany 8/15 19:47 0+00:01:05 C 8/15 19:48 /home/linkuany/t3work9/outputs_limits/R1508/info/limits_wpr_tbhad1250.sh 767313.0 linkuany 8/15 19:47 0+00:00:59 C 8/15 19:48 /home/linkuany/t3work9/outputs_limits/R1508/info/limits_wpr_tbhad1000.sh 767285.0 linkuany 8/15 19:46 0+00:00:00 X ??? /home/linkuany/t3work9/outputs_limits/R1508/info/limits_wpr_tbhad4500.sh 767293.0 linkuany 8/15 19:46 0+00:00:00 X ??? /home/linkuany/t3work9/outputs_limits/R1508/info/limits_wpr_tbhad2500.sh 767301.0 linkuany 8/15 19:46 0+00:00:00 X ??? /home/linkuany/t3work9/outputs_limits/R1508/info/limits_wpr_tbhad1250.sh 767309.0 linkuany 8/15 19:46 0+00:00:00 X ??? /home/linkuany/t3work9/outputs_limits/R1508/info/limits_wpr_tbhad3500.sh 767284.0 linkuany 8/15 19:46 0+00:00:00 X ??? /home/linkuany/t3work9/outputs_limits/R1508/info/limits_wpr_tbhad4000.sh 767292.0 linkuany 8/15 19:46 0+00:00:00 X ??? /home/linkuany/t3work9/outputs_limits/R1508/info/limits_wpr_tbhad2250.sh 767300.0 linkuany 8/15 19:46 0+00:00:00 X ??? /home/linkuany/t3work9/outputs_limits/R1508/info/limits_wpr_tbhad1000.sh 767308.0 linkuany 8/15 19:46 0+00:00:00 X ??? /home/linkuany/t3work9/outputs_limits/R1508/info/limits_wpr_tbhad3000.sh 767261.0 linkuany 8/15 19:46 0+00:00:20 X ??? /home/linkuany/t3work9/outputs_limits/R1508/info/limits_wpr_tbhad1000.sh 767269.0 linkuany 8/15 19:46 0+00:00:20 X ??? /home/linkuany/t3work9/outputs_limits/R1508/info/limits_wpr_tbhad3000.sh 767283.0 linkuany 8/15 19:46 0+00:00:00 X ??? /home/linkuany/t3work9/outputs_limits/R1508/info/limits_wpr_tbhad3500.sh 767291.0 linkuany 8/15 19:46 0+00:00:00 X ??? /home/linkuany/t3work9/outputs_limits/R1508/info/limits_wpr_tbhad2000.sh 767299.0 linkuany 8/15 19:46 0+00:00:00 X ??? /home/linkuany/t3work9/outputs_limits/R1508/info/limits_wpr_tbhad5000.sh 767307.0 linkuany 8/15 19:46 0+00:00:00 X ??? /home/linkuany/t3work9/outputs_limits/R1508/info/limits_wpr_tbhad2750.sh 767282.0 linkuany 8/15 19:46 0+00:00:00 X ??? /home/linkuany/t3work9/outputs_limits/R1508/info/limits_wpr_tbhad3000.sh 767290.0 linkuany 8/15 19:46 0+00:00:00 X ??? /home/linkuany/t3work9/outputs_limits/R1508/info/limits_wpr_tbhad1750.sh 767298.0 linkuany 8/15 19:46 0+00:00:00 X ??? /home/linkuany/t3work9/outputs_limits/R1508/info/limits_wpr_tbhad4500.sh
Interactive Jobs
23
• You can submit a job interactively with “condor_submit -I <submit script>”.
• This will put you on a worker node and leave all the script executing to you.
Summary
24
• Condor can make your life even easier than it already does.
• It can submit multiple jobs for you if you let it.
• Lots of different commands for checking on the status of your jobs and the nodes.
• Next week or the week after I will give a talk on using DagMan, a condor tool that allows you to submit multiple steps all at once. So say you have two submit scripts (A and B) and A needs to run before B starts, DagMan can take care of that for you (as well as any scripts that need to take place before or after each set of jobs).