Introduction to High Throughput Computing with HTCondor · PDF fileNMRbox summer workshop June...
Transcript of Introduction to High Throughput Computing with HTCondor · PDF fileNMRbox summer workshop June...
NMRbox summer workshop June 26-29,2017
Introduction to High Throughput Computing
with HTCondorJonathan Wedell
NMRbox summer workshop June 26-29, 2017
What is HTCondor• “Specialized workload management system for compute-intensive jobs.”
Provides:
• Job queueing, scheduling, and querying
• Resource management
• Priority management
• Other workload management systems
• SLURM
• Ganglia
• Oracle/Sun Grid Engine
NMRbox summer workshop June 26-29, 2017
History• Developed at University of Wisconsin department
of computer science since 1988
• Originally called “Condor”
• Name changed to avoid lawsuit
• Condor is now High Throughput Condor (HTCondor)
• The HT is silent
NMRbox summer workshop June 26-29, 2017
HTC vs HPC
• HTCondor is high throughput computing, not high performance computing
• When your work is “embarrassingly parallel” think HTCondor
NMRbox summer workshop June 26-29, 2017
Why use HTCondor?• Access to additional computing resources
• Job scheduling and management
• Reliability
• Automatically re-running failed jobs is possible
• Automating workflows
• DAGMan
NMRbox summer workshop June 26-29, 2017
Condor Pool• HTCondor allows creating pools of different
machines
• Submitted jobs can run on any machine in the pool that meets the requirements
• For the workshop, the pool is comprised only of one machine
• 30 execute slots
NMRbox summer workshop June 26-29, 2017
Job Scheduling• When should the job run?
• How many times should it run?
• What resources are needed?
• Memory, HD, CPU cores
• When should it run on which machines?
• Desktops at night
• Servers during the day
NMRbox summer workshop June 26-29, 2017
Reliability• Jobs can automatically restart upon failure
• Logs automatically capture when jobs ran and any output and error generated
• Checkpointing allows a job to save state and move from one machine to another or resume from last checkpoint upon failure
• Requires special linking
NMRbox summer workshop June 26-29, 2017
DAGMan
• Directed Acyclic Graph Manager
• Allows creating a hierarchy of jobs and establishing dependencies between them
• Explained in detail in Kumaran’s slides
NMRbox summer workshop June 26-29, 2017
Preparing to use HTCondor• Executable must be able to run non-interactively
• Works best with self-contained binary/script
• Possible to use for more complex packages, but some restrictions apply
• Optionally relink binary with HTCondor
• Allows saving state and forwarding system calls to the local machine
NMRbox summer workshop June 26-29, 2017
Submitting to HTCondor• The submit file is the most basic description of a
job to run using HTCondor. Specify:
• Universe
• Requirements
• Log locations
• Arguments
NMRbox summer workshop June 26-29, 2017
Universes• Suggested universes:
• Standard
• Vanilla
• Java
• Other universes:
• Parallel
• Scheduler
• Grid
• VM
NMRbox summer workshop June 26-29, 2017
Requirements• CPU, memory, disk availability
• Network filesystems
• Other arbitrary parameters
• “This job can only run on machine xyz.”
• “This job requires Matlab version x”
NMRbox summer workshop June 26-29, 2017
Other options• Where to save log files
• What arguments to run software with
• How many times to run
• Potentially with different arguments
• Under what conditions to stop execution or pause execution
• What files to transfer to the execute machine
• Supports downloading files using HTTP
NMRbox summer workshop June 26-29, 2017
CS-Rosetta ExamplePreparation
Structure Calculation
Structure Calculation
Structure Calculation
Structure Calculation
Structure Calculation
Structure Calculation
Cleanup
NMRbox summer workshop June 26-29, 2017
Interacting with HTCondor• condor_status
• Shows available cores
• condor_q
• Shows jobs in the queue
• condor_rm job_id|user_id
• Remove jobs by ID or user