Introduction to High Throughput Computing with HTCondor · PDF fileNMRbox summer workshop June...

19
NMRbox summer workshop June 26-29,2017 Introduction to High Throughput Computing with HTCondor Jonathan Wedell

Transcript of Introduction to High Throughput Computing with HTCondor · PDF fileNMRbox summer workshop June...

NMRbox summer workshop June 26-29,2017

Introduction to High Throughput Computing

with HTCondorJonathan Wedell

NMRbox summer workshop June 26-29, 2017

What is HTCondor• “Specialized workload management system for compute-intensive jobs.”

Provides:

• Job queueing, scheduling, and querying

• Resource management

• Priority management

• Other workload management systems

• SLURM

• Ganglia

• Oracle/Sun Grid Engine

NMRbox summer workshop June 26-29, 2017

History• Developed at University of Wisconsin department

of computer science since 1988

• Originally called “Condor”

• Name changed to avoid lawsuit

• Condor is now High Throughput Condor (HTCondor)

• The HT is silent

NMRbox summer workshop June 26-29, 2017

HTC vs HPC

• HTCondor is high throughput computing, not high performance computing

• When your work is “embarrassingly parallel” think HTCondor

NMRbox summer workshop June 26-29, 2017

High Performance Computing

NMRbox summer workshop June 26-29, 2017

High Throughput Computing

NMRbox summer workshop June 26-29, 2017

Why use HTCondor?• Access to additional computing resources

• Job scheduling and management

• Reliability

• Automatically re-running failed jobs is possible

• Automating workflows

• DAGMan

NMRbox summer workshop June 26-29, 2017

Condor Pool• HTCondor allows creating pools of different

machines

• Submitted jobs can run on any machine in the pool that meets the requirements

• For the workshop, the pool is comprised only of one machine

• 30 execute slots

NMRbox summer workshop June 26-29, 2017

Job Scheduling• When should the job run?

• How many times should it run?

• What resources are needed?

• Memory, HD, CPU cores

• When should it run on which machines?

• Desktops at night

• Servers during the day

NMRbox summer workshop June 26-29, 2017

Reliability• Jobs can automatically restart upon failure

• Logs automatically capture when jobs ran and any output and error generated

• Checkpointing allows a job to save state and move from one machine to another or resume from last checkpoint upon failure

• Requires special linking

NMRbox summer workshop June 26-29, 2017

DAGMan

• Directed Acyclic Graph Manager

• Allows creating a hierarchy of jobs and establishing dependencies between them

• Explained in detail in Kumaran’s slides

NMRbox summer workshop June 26-29, 2017

Preparing to use HTCondor• Executable must be able to run non-interactively

• Works best with self-contained binary/script

• Possible to use for more complex packages, but some restrictions apply

• Optionally relink binary with HTCondor

• Allows saving state and forwarding system calls to the local machine

NMRbox summer workshop June 26-29, 2017

Submitting to HTCondor• The submit file is the most basic description of a

job to run using HTCondor. Specify:

• Universe

• Requirements

• Log locations

• Arguments

NMRbox summer workshop June 26-29, 2017

Universes• Suggested universes:

• Standard

• Vanilla

• Java

• Other universes:

• Parallel

• Scheduler

• Grid

• VM

NMRbox summer workshop June 26-29, 2017

Requirements• CPU, memory, disk availability

• Network filesystems

• Other arbitrary parameters

• “This job can only run on machine xyz.”

• “This job requires Matlab version x”

NMRbox summer workshop June 26-29, 2017

Other options• Where to save log files

• What arguments to run software with

• How many times to run

• Potentially with different arguments

• Under what conditions to stop execution or pause execution

• What files to transfer to the execute machine

• Supports downloading files using HTTP

NMRbox summer workshop June 26-29, 2017

CS-Rosetta ExamplePreparation

Structure Calculation

Structure Calculation

Structure Calculation

Structure Calculation

Structure Calculation

Structure Calculation

Cleanup

NMRbox summer workshop June 26-29, 2017

Interacting with HTCondor• condor_status

• Shows available cores

• condor_q

• Shows jobs in the queue

• condor_rm job_id|user_id

• Remove jobs by ID or user

NMRbox summer workshop June 26-29, 2017

Live Demo