Introduction to Condor DMD/DFS J.Knudstrup December 2005.

34
Introduction to Condor DMD/DFS J.Knudstrup December 2005

Transcript of Introduction to Condor DMD/DFS J.Knudstrup December 2005.

Page 1: Introduction to Condor DMD/DFS J.Knudstrup December 2005.

Introduction to Condor

DMD/DFS J.Knudstrup

December 2005

Page 2: Introduction to Condor DMD/DFS J.Knudstrup December 2005.

Motivation• Need for a system to harvest unused CPU

cycles and other resources in a network.

Page 3: Introduction to Condor DMD/DFS J.Knudstrup December 2005.

What is Condor?• Full-featured batch queue system.• Condor is a specialized workload management system for compute-

intensive jobs.• Condor provides a job queuing mechanism, scheduling policy,

priority scheme, resource monitoring, and resource management.• Users submit their serial or parallel jobs to Condor• Condor places them into a queue, chooses when and where to run

the jobs based upon a policy, carefully monitors their progress, and ultimately informs the user upon completion.

• Can be used to manage a cluster of dedicated compute nodes • In addition, unique mechanisms enable Condor to effectively

harvest wasted CPU power from otherwise idle desktop workstations.

• Condor does not require a shared file system across machines - if no shared file system is available, Condor can transfer the job's data files on behalf of the user

• Condor can be used to seamlessly combine all of an organization's computational power into one resource.

Page 4: Introduction to Condor DMD/DFS J.Knudstrup December 2005.

History of Condor

• Hosted at University of Wisconsin, USA.• Condor project started in 1988.• Directed by Professor M.Livny.• Preliminary version of the Condor Resource

Management system implemented in 1986.• Originally focusing on the problem of Load Balancing in a

distributed system, • Shifted its attention to Distributively Owned computing

environments where owners have full control over the resources they own.

Page 5: Introduction to Condor DMD/DFS J.Knudstrup December 2005.

Status• ~17 years

development.• Condor team consists

of ~30 people.• Available on many

platforms.• Basic installation and

usage very easy.• Contracted + free

support.• Used in research

environments and by industry.

• Sponsored by various major IT companies and organizations (IBM, Intel, Microsoft, NASA, …).

Page 6: Introduction to Condor DMD/DFS J.Knudstrup December 2005.

Architecture• Coordinated by a Central Manager Node.• No central DBMS.• Condor provides set of daemons defining the

roles of each node in the pool.• Daemons:

– condor_master: Basic coordination on each node.– condor_collector: Collects system information. Only

on Central Manager.– condor_negotiator: Assigns jobs to machines. Only

on Central Manager.– condor_startd: Executes jobs.– condor_schedd: Handles job submission.

Page 7: Introduction to Condor DMD/DFS J.Knudstrup December 2005.

Condor Pool - Example

Central Manager

master

collector

negotiator

schedd

startd

= ClassAd Communication Pathway

= Process Spawned

Submit-Only

master

schedd

Execute-Only

master

startd

Regular Node

schedd

startd

master

Regular Node

schedd

startd

master

Execute-Only

master

startd

Page 8: Introduction to Condor DMD/DFS J.Knudstrup December 2005.

Personal Condor vs. Condor Pool

• Condor Pool:– Collection of several nodes coordinated by one, Central Manager.

• Personal Condor:– Condor on one workstation, no root access required, no system

administrator intervention needed.– Benefits of ‘pool’ with only one node (same as for a pool):

• Schedule large batches of jobs and have these processed in background.• Keep an eye on jobs and get progress updates.

• Implement own scheduling policies on the execution order of jobs.

• Keep a log of the job activities.

• Add fault tolerance to the job execution.

• Implement policies for when jobs can run on a workstation.

Page 9: Introduction to Condor DMD/DFS J.Knudstrup December 2005.

Dedicated Nodes vs. Non-Dedicated Nodes

• Dedicated Node:– Condor has all CPUs at its disposal.

• Non-Dedicated Node:– Can't always run Condor jobs.– If user is accessing keyboard/mouse or CPU

is used by other processes, the Condor jobs are preempted.

– The policies for when Condor jobs can be started and may be preempted, are defined in the Condor Configuration.

Page 10: Introduction to Condor DMD/DFS J.Knudstrup December 2005.

Shared File System vs. File Distribution

• Use Shared Filesystem if available– Administration and handling easier.– Normally the case for a Dedicated Cluster.

• If no shared filesystem?– Condor can transfer files.– Can automatically send back changed files.– Atomic transfer of multiple files.– Data can be encrypted during transfer.– Usually the case for pools with non-dedicated nodes

or in a GRID environment.

Page 11: Introduction to Condor DMD/DFS J.Knudstrup December 2005.

Personal Condor - Condor PoolCondor Flocking

PersonalCondor

DedicatedPool

Common User Desktop Pool

SubmissionNode

PersonalCondor

Page 12: Introduction to Condor DMD/DFS J.Knudstrup December 2005.

Condor Configuration• Simple format (based on ClassAd).• Possible to use environment variables.• Global configuration and local specific to each node.• Large set of configurable parameters.• Example:

CONDOR_HOST = dfo09.hq.eso.orgRELEASE_DIR = /home/condor/INSTROOT/LOCAL_DIR = $(TILDE)LOCAL_CONFIG_FILE = $(LOCAL_DIR)/condor_config.localREQUIRE_LOCAL_CONFIG_FILE = TRUECONDOR_ADMIN = [email protected] = /bin/mailUID_DOMAIN = hq.eso.orgFILESYSTEM_DOMAIN = $(FULL_HOSTNAME)…

Page 13: Introduction to Condor DMD/DFS J.Knudstrup December 2005.

Command Line Tools

• Many command line tools provided – some of these are:– condor_config_val: Get/set value of configuration parameters.– condor_history: Query job history queue.– condor_off: Stop Condor daemons.– condor_q: Check job queue.– condor_reconfig: Force sourcing of configuration.– condor_rm: Remove jobs from the queue.– condor_status: Status of Condor pool.– condor_submit: Submit a job or a cluster of jobs.– condor_submit_dag: Submit a set of jobs with dependencies.– …

Page 14: Introduction to Condor DMD/DFS J.Knudstrup December 2005.

condor_history[condor@dfo09 condor]$ condor_history -l 4174(ClusterId == 4174)MyType = "Job"TargetType = "Machine"ClusterId = 4174QDate = 1130316429Owner = "sinfoni"LocalUserCpu = 0.000000LocalSysCpu = 0.000000RemoteUserCpu = 0.000000RemoteSysCpu = 0.000000ExitStatus = 0NumCkpts = 0NumRestarts = 0NumSystemHolds = 0CommittedTime = 0TotalSuspensions = 0LastSuspensionTime = 0CumulativeSuspensionTime = 0CondorVersion = "$CondorVersion: 6.6.8 Jan 27 2005 $"CondorPlatform = "$CondorPlatform: I386-LINUX_RH9 $"RootDir = "/"Iwd = "/home/condor/data/sinfoni/products/condor/dag/CALIB_2005-10-02-1130316408.83897996"JobUniverse = 5Cmd = "/home/sinfoni/bin/processAB"…

Page 15: Introduction to Condor DMD/DFS J.Knudstrup December 2005.

condor_status[Condor@ngasdev3 condor]$ condor_status

Name OpSys Arch State Activity LoadAv Mem ActvtyTime

vm1@ngasdev3. LINUX INTEL Owner Idle 0.120 252 0+00:00:04vm2@ngasdev3. LINUX INTEL Unclaimed Idle 0.000 252 0+00:20:05vm3@ngasdev3. LINUX INTEL Unclaimed Idle 0.000 252 0+00:20:06vm4@ngasdev3. LINUX INTEL Unclaimed Idle 0.000 252 0+00:20:07

Machines Owner Claimed Unclaimed Matched Preempting

INTEL/LINUX 4 1 0 3 0 0

Total 4 1 0 3 0 0

Page 16: Introduction to Condor DMD/DFS J.Knudstrup December 2005.

condor_q[condor@dfo09 condor]$ condor_q

-- Submitter: dfo09.hq.eso.org : <134.171.16.145:58750> : dfo09.hq.eso.org ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 4157.0 sinfoni 10/26 10:46 0+00:34:25 R 0 3.2 condor_dagman -f -4177.0 sinfoni 10/26 10:47 0+00:01:47 R 0 0.0 processAB -a SINFO4178.0 sinfoni 10/26 10:47 0+00:01:26 R 0 0.0 processAB -a SINFO4179.0 sinfoni 10/26 10:47 0+00:00:00 I 0 0.0 processAB -a SINFO4180.0 sinfoni 10/26 10:47 0+00:00:00 I 0 0.0 processAB -a SINFO4181.0 sinfoni 10/26 10:47 0+00:00:00 I 0 0.0 processAB -a SINFO4182.0 sinfoni 10/26 10:47 0+00:00:00 I 0 0.0 processAB -a SINFO4183.0 sinfoni 10/26 10:47 0+00:00:00 I 0 0.0 processAB -a SINFO…4201.0 sinfoni 10/26 10:47 0+00:00:00 I 0 0.0 processAB -a SINFO4202.0 sinfoni 10/26 10:47 0+00:00:00 I 0 0.0 processAB -a SINFO4203.0 sinfoni 10/26 10:47 0+00:00:00 I 0 0.0 processAB -a SINFO4204.0 sinfoni 10/26 10:47 0+00:00:00 I 0 0.0 processAB -a SINFO4205.0 sinfoni 10/26 10:47 0+00:00:00 I 0 0.0 processAB -a SINFO4206.0 sinfoni 10/26 10:47 0+00:00:00 I 0 0.0 processAB -a SINFO4207.0 sinfoni 10/26 10:47 0+00:00:00 I 0 0.0 processAB -a SINFO4208.0 sinfoni 10/26 10:47 0+00:00:00 I 0 0.0 processAB -a SINFO4209.0 sinfoni 10/26 10:47 0+00:00:00 I 0 0.0 processAB -a SINFO

34 jobs; 31 idle, 3 running, 0 held

Page 17: Introduction to Condor DMD/DFS J.Knudstrup December 2005.

Requirements for a Condor Job

• Must be able to run in the background: no interactive input, windows, GUI, etc.

• Can still use STDIN, STDOUT, and STDERR, but files are used for these instead of the actual devices

• Organize data files, make data available for the jobs.

Page 18: Introduction to Condor DMD/DFS J.Knudstrup December 2005.

Execute MachineSubmit Machine

Job Scheduling

Submit

Schedd

Starter JobShadow

Startd

Central Manager

CollectorNegotiator

Page 19: Introduction to Condor DMD/DFS J.Knudstrup December 2005.

Job Universes• A universe in Condor defines an execution environment.• The universe to use for a job is specified upon job scheduling.• Following universes provided by Condor:

– Standard Universe: Close integration between the job and Condor. Application must be re-linked with condor_compile.

– Vanilla Universe: Jobs executed as shell commands. Condor collects output and exit status.

– PVM Universe: Allows programs written for the Parallel Virtual Machine interface to be used within the Condor environment.

– MPI Universe: Allows programs written to the MPICH interface to be used within the Condor environment.

– Globus Universe: Provide standard Condor interface to start Globus jobs from Condor.

– Java Universe: Execute natively jobs based on Java applications.– Scheduler Universe: Job does not wait to be matched with a machine,

it executes right away, on the machine where the job is submitted.

Page 20: Introduction to Condor DMD/DFS J.Knudstrup December 2005.

Example Simple Job Submission

$ more ~/tmp/Job1.cmduniverse = vanillaexecutable = /bin/sleepoutput = /home/condor/tmp/Job1.outerror = /home/condor/tmp/Job1.errlog = /home/condor/tmp/Job1.logarguments = 5#requirements = (use default requirements)should_transfer_files = NOnotification = NEVERqueue

Page 21: Introduction to Condor DMD/DFS J.Knudstrup December 2005.

Job Monitoring• While the job is running:$ condor_q

-- Submitter: ngasdev3.hq.eso.org : <134.171.21.32:35346> : ngasdev3.hq.eso.org

ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD

11848.0 condor 11/17 09:17 0+00:00:03 R 0 0.0 sleep 5

1 jobs; 0 idle, 1 running, 0 held

• After job completion (no other jobs running):$ condor_q

-- Submitter: ngasdev3.hq.eso.org : <134.171.21.32:35346> : ngasdev3.hq.eso.org

ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD

0 jobs; 0 idle, 0 running, 0 held

Page 22: Introduction to Condor DMD/DFS J.Knudstrup December 2005.

Job History• Query historical info about job, after it terminated:$ condor_history -l 11848|more(ClusterId == 11848)MyType = "Job"TargetType = "Machine"ClusterId = 11848QDate = 1132219077Owner = "condor"ExitStatus = 0NumRestarts = 0NumSystemHolds = 0CommittedTime = 0TotalSuspensions = 0CondorVersion = "$CondorVersion: 6.6.8 Jan 27 2005 $"CondorPlatform = "$CondorPlatform: I386-LINUX_RH9 $"RootDir = "/"Iwd = "/diska/home/condor/tmp"JobUniverse = 5…

Page 23: Introduction to Condor DMD/DFS J.Knudstrup December 2005.

Jobs and Resources

• Job, requiring certain amount of memory + disk space to run.

• The higher the RANK, the better the match.$ more ~/tmp/Job-reqs-ex.cmd

universe = vanilla

executable = /bin/sleep

output = /home/condor/tmp/Job-reqs-ex.out

error = /home/condor/tmp/Job-reqs-ex.err

log = /home/condor/tmp/Job-reqs-ex.log

arguments = 5

Requirements = Memory >= 256 && Disk > 10000Rank = (KFLOPS*10000) + Memoryshould_transfer_files = NO

notification = NEVER

queue

Page 24: Introduction to Condor DMD/DFS J.Knudstrup December 2005.

Submitting Clusters of Jobs

# Example condor_submit input file that defines# a cluster of 600 jobs with different directoriesUniverse = vanillaExecutable = my_jobLog = my_job.logArguments = -arg1 –arg2Input = my_job.stdinOutput = my_job.stdoutError = my_job.stderrInitialDir = run_$(Process)Queue 600

Page 25: Introduction to Condor DMD/DFS J.Knudstrup December 2005.

File Transferringuniverse = vanillaexecutable = /home/condor/bin/process-files.pyoutput = /home/condor/data/out/transferdata1.outerror = /home/condor/data/err/transferdata1.errlog = /home/condor/data/log/transferdata1.logarguments = input1.in input2.in input3.inrequirements = should_transfer_files = YESwhen_to_transfer_output = ON_EXITtransfer_input_files = input1.in,input2.in,input3.intransfer_output_files = input1.out,input2.out,input3.outnotification = NEVERQueue

Page 26: Introduction to Condor DMD/DFS J.Knudstrup December 2005.

DAGs

• Directed Acyclic Graph (DAG).• Represents a set of jobs with mutual

dependencies.• Corresponds to the “Cascade” in the ‘DFS

world’.• Has to specify a DAG Submission file which

makes references to Job Submission Files.• Submitted with condor_submit_dag.• Controlled by DAGMan utility running as a

normal Condor job.• Possible to make DAGs of DAGs.

Page 27: Introduction to Condor DMD/DFS J.Knudstrup December 2005.

Example simple DAGJob Job-1-1 /home/condor/tmp/dag-ex1/Job-1-1.cmdJob Job-2-1 /home/condor/tmp/dag-ex1/Job-2-1.cmdJob Job-2-2 /home/condor/tmp/dag-ex1/Job-2-2.cmdJob Job-3-1 /home/condor/tmp/dag-ex1/Job-3-1.cmd

PARENT Job-1-1 CHILD Job-2-1PARENT Job-1-1 CHILD Job-2-2PARENT Job-2-1 Job-2-2 CHILD Job-3-1

DOT dag-ex1.dot DONT-OVERWRITE UPDATE

1-1

2-1 2-2

3-1

Page 28: Introduction to Condor DMD/DFS J.Knudstrup December 2005.

DAG Execution Status$ condor_submit_dag dag-ex1.dag

-----------------------------------------------------------------------File for submitting this DAG to Condor : dag-ex1.dag.condor.subLog of DAGMan debugging messages : dag-ex1.dag.dagman.outLog of Condor library debug messages : dag-ex1.dag.lib.outLog of the life of condor_dagman itself : dag-ex1.dag.dagman.log

Condor Log file for all Condor jobs of this DAG: dag-ex1.dag.dummy_logSubmitting job(s).Logging submit event(s).1 job(s) submitted to cluster 11849.-----------------------------------------------------------------------

Page 29: Introduction to Condor DMD/DFS J.Knudstrup December 2005.

DAG Visualization• Possible to visualize DAG.• DAGMan process produces snapshot files showing the status of the

DAG execution.• Can be processed with the Graphviz package:

Page 30: Introduction to Condor DMD/DFS J.Knudstrup December 2005.

Scheduling Vizualization

• Example of monitoring a running DFO/QC DAG/Cascade:

Page 31: Introduction to Condor DMD/DFS J.Knudstrup December 2005.

Pipeline Cascade• A science pipeline cascade may look like this:

Preproc

Recipe A/1

Recipe A/2

Recipe A/N

.

.

.

Recipe B/1

Recipe B/2

Recipe B/N

.

.

.

Recipe C/1

Recipe C/2

Recipe C/N

.

.

.

Postproc

Page 32: Introduction to Condor DMD/DFS J.Knudstrup December 2005.

DFS Condor Integration Activities

• In the process of defining an environment for the Condor/BQS for DFO/QC and Paranal.

• A few tools implemented to facilitate the interaction with Condor.• Will purchase blade systems for DFO/QC and for Paranal (+ file

servers based on a fiber channel network).• At Paranal Condor might be controlled directly from the Data

Organizer (new implementation).• Will use shared file system for all nodes in the cluster (RedHat

Global File System).• Blade systems at HQ, will be closely integrated with the archive for

fast file access (Fast Cache Archive).

Page 33: Introduction to Condor DMD/DFS J.Knudstrup December 2005.

GFSRedHat Global File System

http://www.redhat.com/software/rha/gfs

Dedicated PoolBlade Systems

DFO/QC Condor Pool(s)

SubmitNode

PersonalCondor

PersonalCondor

PersonalCondor

PersonalCondor

PersonalCondor

SubmitNode

SubmitNode

Condor

Page 34: Introduction to Condor DMD/DFS J.Knudstrup December 2005.

More Info

• Condor WEB site:

http://www.cs.wisc.edu/condor

" ... Since the early days of mankind the primary motivation for the establishment of communities has been the idea that by being part of an organized group the capabilities of an individual are improved. The great progress in the area of inter-computer communication led to the development of means by which stand-alone processing sub-systems can be integrated into multi-computer 'communities'. ... "

Miron Livny (Creator of Condor)