The Campus Cluster. What is the Campus Cluster? Batch job system High throughput High latency...

45
The Campus Cluster

Transcript of The Campus Cluster. What is the Campus Cluster? Batch job system High throughput High latency...

Page 1: The Campus Cluster. What is the Campus Cluster? Batch job system High throughput High latency Available resources: – ~450 nodes – 12 Cores/node – 24-96.

The Campus Cluster

Page 2: The Campus Cluster. What is the Campus Cluster? Batch job system High throughput High latency Available resources: – ~450 nodes – 12 Cores/node – 24-96.

What is the Campus Cluster?

• Batch job system• High throughput• High latency• Available resources:– ~450 nodes– 12 Cores/node– 24-96 GB memory– Shared high performance filesystem– High speed multinode message passing

Page 3: The Campus Cluster. What is the Campus Cluster? Batch job system High throughput High latency Available resources: – ~450 nodes – 12 Cores/node – 24-96.

What isn’t the Campus Cluster?

• Not: Instantly available computation resource– Can wait up to 4 hours for a node

• Not: High I/O Friendly– Network disk access can hurt performance

• Not: ….

Page 4: The Campus Cluster. What is the Campus Cluster? Batch job system High throughput High latency Available resources: – ~450 nodes – 12 Cores/node – 24-96.

Getting Set Up

Page 5: The Campus Cluster. What is the Campus Cluster? Batch job system High throughput High latency Available resources: – ~450 nodes – 12 Cores/node – 24-96.

Getting started

• Request an account: https://campuscluster.illinois.edu/invest/user_form.html

• Connecting:ssh to taub.campuscluster.illinois.eduUse netid and AD password

Page 6: The Campus Cluster. What is the Campus Cluster? Batch job system High throughput High latency Available resources: – ~450 nodes – 12 Cores/node – 24-96.

Where to put data• Home Directory ~/

– Backed up, currently no quota (in future 10’s of GB)• Use /scratch for temporary data - ~10TB

– Scratch data is currently deleted after ~3 months– Available on all nodes– No backup

• /scratch.local - ~100GB– Local to each node, not shared across network– Beware that other users may fill disk

• /projects/VisionLanguage/ - ~15TB– Keep things tidy by creating a directory for your netid– Backed up

• Current Filesystem best practices (Should improve for Cluster v. 2):– Try to do batch writes to one large file– Avoid many little writes to many little files

Page 7: The Campus Cluster. What is the Campus Cluster? Batch job system High throughput High latency Available resources: – ~450 nodes – 12 Cores/node – 24-96.

Backup = Snapshots(Just learned this yesterday)

• Snapshots taken daily

• Not intended for disaster recovery – Stored on same disk as data

• Intended for accidental deletes/overwrites, etc.– Backed up data can be accessed at:/gpfs/ddn_snapshot/.snapshots/<date>/<path>

e.g. recover accidentally deleted file in home directory: /gpfs/ddn_snapshot/.snapshots/2012-12-24/home/iendres2/christmas_list

Page 8: The Campus Cluster. What is the Campus Cluster? Batch job system High throughput High latency Available resources: – ~450 nodes – 12 Cores/node – 24-96.

Moving data to/from cluster

• Only option right now is sftp/scp

• SSHFS lets you mount a directory from remote machines– Haven’t tried this, but might be useful

Page 9: The Campus Cluster. What is the Campus Cluster? Batch job system High throughput High latency Available resources: – ~450 nodes – 12 Cores/node – 24-96.

Modules

[iendres2 ~]$ modules load <modulename>

Manages environment, typically used to add software to path:– To get the latest version of matlab:

[iendres2 ~]$ modules load matlab/7.14– To find modules such as vim, svn:

[iendres2 ~]$ modules avail

Page 10: The Campus Cluster. What is the Campus Cluster? Batch job system High throughput High latency Available resources: – ~450 nodes – 12 Cores/node – 24-96.

Useful Startup Options

Appended to the end of my bashrc:– Make default permissions the same for user and

group, useful when working on a joint project• umask u=rwx,g=rwx

– Safer alternative – don’t allow writing• umask u=rwx,g=rx

– Load common modules• module load vim• module load svn• module load matlab

Page 11: The Campus Cluster. What is the Campus Cluster? Batch job system High throughput High latency Available resources: – ~450 nodes – 12 Cores/node – 24-96.

Submitting Jobs

Page 12: The Campus Cluster. What is the Campus Cluster? Batch job system High throughput High latency Available resources: – ~450 nodes – 12 Cores/node – 24-96.

Queues

– Primary (VisionLanguage)• Nodes we own (Currently 8)• Jobs can last 72 hours• We have priority access

– Secondary (secondary)• Anyone else’s idle nodes (~500)• Jobs can only last 4 hours, automatically killed• Not unusual to wait 12 hours for job to begin runing

Page 13: The Campus Cluster. What is the Campus Cluster? Batch job system High throughput High latency Available resources: – ~450 nodes – 12 Cores/node – 24-96.

Scheduler

• Typically behaves as first come first serve

• Claims of priority scheduling, we don’t know how it works…

Page 14: The Campus Cluster. What is the Campus Cluster? Batch job system High throughput High latency Available resources: – ~450 nodes – 12 Cores/node – 24-96.

Types of job

– Batch job• No graphics, runs and completes without user

interaction

– Interactive Jobs• Brings remote shell to your terminal• X-forwarding available for graphics

• Both wait in queue the same way

Page 15: The Campus Cluster. What is the Campus Cluster? Batch job system High throughput High latency Available resources: – ~450 nodes – 12 Cores/node – 24-96.

Scheduling jobs

– Batch job• [iendres2 ~]$ qsub <job_script>

• job_script defines parameters of job and the actual command to run• Details on job scripts to follow

– Interactive Jobs• [iendres2 ~]$ qsub -q <queuename> -I -l walltime=00:30:00,nodes=1:ppn=12

• Include –X for X-forwarding• Details on –l parameters to follow

Page 16: The Campus Cluster. What is the Campus Cluster? Batch job system High throughput High latency Available resources: – ~450 nodes – 12 Cores/node – 24-96.

Configuring Jobs

Page 17: The Campus Cluster. What is the Campus Cluster? Batch job system High throughput High latency Available resources: – ~450 nodes – 12 Cores/node – 24-96.

Basics

• Parameters of jobs are defined by a bash script which contains “PBS commands” followed by script to execute

#PBS -q VisionLanguage#PBS -l nodes=1:ppn=12#PBS -l walltime=04:00:00…cd ~/workdir/echo “This is job number ${PBS_JOBID}”

Page 18: The Campus Cluster. What is the Campus Cluster? Batch job system High throughput High latency Available resources: – ~450 nodes – 12 Cores/node – 24-96.

Basics

• Parameters of jobs are defined by a bash script which contains “PBS commands” followed by script to execute

#PBS -q VisionLanguage#PBS -l nodes=1:ppn=12#PBS -l walltime=04:00:00…cd ~/workdir/echo “This is job number ${PBS_JOBID}”

Queue to use:VisionLanguage or secondary

Page 19: The Campus Cluster. What is the Campus Cluster? Batch job system High throughput High latency Available resources: – ~450 nodes – 12 Cores/node – 24-96.

Basics

• Parameters of jobs are defined by a bash script which contains “PBS commands” followed by script to execute

#PBS -q VisionLanguage#PBS -l nodes=1:ppn=12#PBS -l walltime=04:00:00…cd ~/workdir/echo “This is job number ${PBS_JOBID}”

• Number of nodes – 1, unless using MPI or other distributed programming

• Processors per node – Always 12, smallest computation unit is a physical node, which has 12 cores (with current hardware)*

*Some queues are configured to allow multiple concurrent jobs per node, but this is uncommon

Page 20: The Campus Cluster. What is the Campus Cluster? Batch job system High throughput High latency Available resources: – ~450 nodes – 12 Cores/node – 24-96.

Basics

• Parameters of jobs are defined by a bash script which contains “PBS commands” followed by script to execute

#PBS -q VisionLanguage#PBS -l nodes=1:ppn=12#PBS -l walltime=04:00:00…cd ~/workdir/echo “This is job number ${PBS_JOBID}”

• Maximum time job will run for – it is killed if it exceeds this

• 72:00:00 hours for primary queue• 04:00:00 hours for secondary queue

Page 21: The Campus Cluster. What is the Campus Cluster? Batch job system High throughput High latency Available resources: – ~450 nodes – 12 Cores/node – 24-96.

Basics

• Parameters of jobs are defined by a bash script which contains “PBS commands” followed by script to execute

#PBS -q VisionLanguage#PBS -l nodes=1:ppn=12#PBS -l walltime=04:00:00…cd ~/workdir/echo “This is job number ${PBS_JOBID}”

Bash comands are allowed anywhere in the script and will be executed on the scheduled worker node after all PBS commands are handled

Page 22: The Campus Cluster. What is the Campus Cluster? Batch job system High throughput High latency Available resources: – ~450 nodes – 12 Cores/node – 24-96.

Basics

• Parameters of jobs are defined by a bash script which contains “PBS commands” followed by script to execute

#PBS -q VisionLanguage#PBS -l nodes=1:ppn=12#PBS -l walltime=04:00:00…cd ~/workdir/echo “This is job number ${PBS_JOBID}”

There are some reserved variables that the scheduler will fill in once the job is scheduled (see `man qsub` for more variables)

Page 23: The Campus Cluster. What is the Campus Cluster? Batch job system High throughput High latency Available resources: – ~450 nodes – 12 Cores/node – 24-96.

BasicsScheduler variables (From manpage)

PBS_O_HOST the name of the host upon which the qsub command is running.

PBS_SERVER the hostname of the pbs_server which qsub submits the job to.

PBS_O_QUEUE the name of the original queue to which the job was submitted.

PBS_O_WORKDIR the absolute path of the current working directory of the qsub command.

PBS_ARRAYID each member of a job array is assigned a unique identifier (see -t)

PBS_ENVIRONMENT set to PBS_BATCH to indicate the job is a batch job, or to PBS_INTERACTIVE to indicate the job is a PBS interac- tive job, see -I option.

PBS_JOBID the job identifier assigned to the job by the batch system.

PBS_JOBNAME the job name supplied by the user.

PBS_NODEFILE the name of the file contain the list of nodes assigned to the job (for parallel and cluster systems).

PBS_QUEUE the name of the queue from which the job is executed.

There are some reserved variables that the scheduler will fill in once the job is scheduled (see `man qsub` for more variables)

Page 24: The Campus Cluster. What is the Campus Cluster? Batch job system High throughput High latency Available resources: – ~450 nodes – 12 Cores/node – 24-96.

Monitoring Jobs[iendres2 ~]$ qstat Sample output:JOBID JOBNAME USER WALLTIME STATE QUEUE333885[].taubm1 r-afm-average hzheng8 0 Q secondary333899.taubm1 test6 lee263 03:33:33 R secondary 333900.taubm1 cgfb-a dcyang2 09:22:44 R secondary 333901.taubm1 cgfb-b dcyang2 09:31:14 R secondary 333902.taubm1 cgfb-c dcyang2 09:28:28 R secondary 333903.taubm1 cgfb-d dcyang2 09:12:44 R secondary 333904.taubm1 cgfb-e dcyang2 09:27:45 R secondary 333905.taubm1 cgfb-f dcyang2 09:30:55 R secondary 333906.taubm1 cgfb-g dcyang2 09:06:51 R secondary 333907.taubm1 cgfb-h dcyang2 09:01:07 R secondary 333908.taubm1 ...conp5_38.namd harpole2 0 H cse 333914.taubm1 ktao3.kpt.12 chandini 03:05:36 C secondary 333915.taubm1 ktao3.kpt.14 chandini 03:32:26 R secondary 333916.taubm1 joblammps daoud2 03:57:06 R cse

States:Q – Queued, waiting to runR – RunningH – Held, by user or admin, won’t run until released (see qhold, qrls)C – Closed – finished runningE – Error – this usually doesn’t happen, indicates a problem with the cluster

grep is your friend for finding specific jobs(e.g. qstat –u iendres2 | grep “ R ” gives all of my running jobs)

Page 25: The Campus Cluster. What is the Campus Cluster? Batch job system High throughput High latency Available resources: – ~450 nodes – 12 Cores/node – 24-96.

Managing Jobs

qalter, qdel, qhold, qmove, qmsg, qrerun, qrls, qselect, qsig, qstat

Each takes a jobid + some arguments

Page 26: The Campus Cluster. What is the Campus Cluster? Batch job system High throughput High latency Available resources: – ~450 nodes – 12 Cores/node – 24-96.

Problem: I want to run the same job with multiple parameters

#PBS -q VisionLanguage#PBS -l nodes=1:ppn=12#PBS -l walltime=04:00:00

cd ~/workdir/./script <param1> <param2>

Solution: Create wrapper script to iterate over params

Where: param1 = {a, b, c}param2 = {1, 2, 3}

Page 27: The Campus Cluster. What is the Campus Cluster? Batch job system High throughput High latency Available resources: – ~450 nodes – 12 Cores/node – 24-96.

Problem 2: I can’t pass parameters into my job script

#PBS -q VisionLanguage#PBS -l nodes=1:ppn=12#PBS -l walltime=04:00:00

cd ~/workdir/./script <param1> <param2>

Solution 2: Hack it!

Where: param1 = {a, b, c}param2 = {1, 2, 3}

Page 28: The Campus Cluster. What is the Campus Cluster? Batch job system High throughput High latency Available resources: – ~450 nodes – 12 Cores/node – 24-96.

Problem 2: I can’t pass parameters into my job script

Where: param1 = {a, b, c}param2 = {1, 2, 3}

We can pass parameters via the jobname, and delimit them using the ‘-’ character (or whatever you want)

#PBS -q VisionLanguage#PBS -l nodes=1:ppn=12#PBS -l walltime=04:00:00

# Pass parameters via jobname:export IFS="-"i=1

for word in ${PBS_JOBNAME}; do echo $word arr[i]=$word ((i++))done

# Stuff to executeecho Jobname: ${arr[1]}cd ~/workdir/echo ${arr[2]} ${arr[3]}

Page 29: The Campus Cluster. What is the Campus Cluster? Batch job system High throughput High latency Available resources: – ~450 nodes – 12 Cores/node – 24-96.

Problem 2: I can’t pass parameters into my job script

Where: param1 = {a, b, c}param2 = {1, 2, 3}

qsub –N job-param1-param2 job_script

qsub’s -N parameter sets the job name

#PBS -q VisionLanguage#PBS -l nodes=1:ppn=12#PBS -l walltime=04:00:00

# Pass parameters via jobname:export IFS="-"i=1

for word in ${PBS_JOBNAME}; do echo $word arr[i]=$word ((i++))done

# Stuff to executeecho Jobname: ${arr[1]}cd ~/workdir/echo ${arr[2]} ${arr[3]}

Page 30: The Campus Cluster. What is the Campus Cluster? Batch job system High throughput High latency Available resources: – ~450 nodes – 12 Cores/node – 24-96.

Problem 2: I can’t pass parameters into my job script

#PBS -q VisionLanguage#PBS -l nodes=1:ppn=12#PBS -l walltime=04:00:00

# Pass parameters via jobname:export IFS="-"i=1

for word in ${PBS_JOBNAME}; do echo $word arr[i]=$word ((i++))done

# Stuff to executeecho Jobname: ${arr[1]}cd ~/workdir/echo ${arr[2]} ${arr[3]}

Where: param1 = {a, b, c}param2 = {1, 2, 3}

qsub –N job-param1-param2 job_script

Output would be:

Jobname: jobparam1 param2

Page 31: The Campus Cluster. What is the Campus Cluster? Batch job system High throughput High latency Available resources: – ~450 nodes – 12 Cores/node – 24-96.

Problem: I want to run the same job with multiple parameters

#PBS -q VisionLanguage#PBS -l nodes=1:ppn=12#PBS -l walltime=04:00:00

# Pass parameters via jobname:export IFS="-"i=1

for word in ${PBS_JOBNAME}; do echo $word arr[i]=$word ((i++))done

# Stuff to executeecho Jobname: ${arr[1]}cd ~/workdir/echo ${arr[2]} ${arr[3]}

Where: param1 = {a, b, c}param2 = {1, 2, 3}

#!/bin/bash

param1=({a,b,c})param2=({1,2,3}) # or {1..3} for p1 in ${param1[@]}; do for p2 in ${param2[@]}; do qsub –N job-${p1}-${p2} job_script donedone

Now Loop!

Page 32: The Campus Cluster. What is the Campus Cluster? Batch job system High throughput High latency Available resources: – ~450 nodes – 12 Cores/node – 24-96.

Problem 3: My job isn’t multithreaded, but needs to run many times

#PBS -q VisionLanguage#PBS -l nodes=1:ppn=12#PBS -l walltime=04:00:00

cd ~/workdir/./script ${idx} Solution: Run 12 independent

processes on the same node so 11 CPU’s don’t sit idle

Page 33: The Campus Cluster. What is the Campus Cluster? Batch job system High throughput High latency Available resources: – ~450 nodes – 12 Cores/node – 24-96.

Problem 3: My job isn’t multithreaded, but needs to run many times

#PBS -q VisionLanguage#PBS -l nodes=1:ppn=12#PBS -l walltime=04:00:00

cd ~/workdir/# Run 12 jobs in the backgroundfor idx in {1..12}; do ./script ${idx} & # Your job goes here (keep the ampersand) pid[idx]=$! # Record the PIDdone

# Wait for all the processes to finish for idx in {1..12}; do echo waiting on ${pid[idx]} wait ${pid[idx]}done

Solution: Run 12 independent processes on the same node so 11 CPU’s don’t sit idle

Page 34: The Campus Cluster. What is the Campus Cluster? Batch job system High throughput High latency Available resources: – ~450 nodes – 12 Cores/node – 24-96.

Matlab and The Cluster

Page 35: The Campus Cluster. What is the Campus Cluster? Batch job system High throughput High latency Available resources: – ~450 nodes – 12 Cores/node – 24-96.

Simple Matlab Sample#PBS -q VisionLanguage#PBS -l nodes=1:ppn=12#PBS -l walltime=04:00:00

cd ~/workdir/matlab -nodisplay -r “matlab_func(); exit;”

Page 36: The Campus Cluster. What is the Campus Cluster? Batch job system High throughput High latency Available resources: – ~450 nodes – 12 Cores/node – 24-96.

Matlab Sample: Passing Parameters

#PBS -q VisionLanguage#PBS -l nodes=1:ppn=12#PBS -l walltime=04:00:00

cd ~/workdir/param = 1param2 = \’string\’ # Escape string parametersmatlab -nodisplay -r “matlab_func(${param}); exit;”

Page 37: The Campus Cluster. What is the Campus Cluster? Batch job system High throughput High latency Available resources: – ~450 nodes – 12 Cores/node – 24-96.

#PBS -q VisionLanguage#PBS -l nodes=1:ppn=12#PBS -l walltime=04:00:00

cd ~/workdir/matlab -nodisplay -r “matlab_func(); exit;”X

Simple Matlab Sample

Running more than a few matlab jobs (thinking about using the secondary queue) ?

You may use too many licenses - especially Distributed Computing Toolbox (e.g. parfor)

Page 38: The Campus Cluster. What is the Campus Cluster? Batch job system High throughput High latency Available resources: – ~450 nodes – 12 Cores/node – 24-96.

Compiling Matlab CodeDoesn’t use any matlab licenses once compiledCompiles matlab code into a standalone executableConstraints:– Code can’t call addpath– Functions called by eval, str2func, or other implicit methods must be

explicitly identified• e.g. for eval(‘do_this’) to work, must also include %#function do_this

To compile (within matlab):>> addpath(‘everything that should be included’)>> mcc –m function_to_compile.m

isdeployed() is useful for modifying behavior for compiled applications(returns true if code is running the compiled version)

Page 39: The Campus Cluster. What is the Campus Cluster? Batch job system High throughput High latency Available resources: – ~450 nodes – 12 Cores/node – 24-96.

Running Compiled Matlab Code

• Requires Matlab compiler runtime>> mcrinstaller % This will point you to the installer and help install it% make note of the installed path MCRPATH (e.g. …/mcr/v716/)

• Compiled code generates two files:– function_to_compile and run_function_to_compile.sh

• To run:– [iendres2 ~]$ ./run_function_to_compile.sh MCRPATH param1 param2 … paramk– Params will be passed into matlab function as usual, except they will always be strings– Useful trick:

function function_to_compile(param1, param2, …, paramk)if(isdeployed) param1 = str2num(param1); %param2 expects a string paramk = str2num(paramk);end

Page 40: The Campus Cluster. What is the Campus Cluster? Batch job system High throughput High latency Available resources: – ~450 nodes – 12 Cores/node – 24-96.

Parallel For Loops on the Cluster

• Not designed for multiple nodes on shared filesystem:– Race condition from concurrent writes to:

~/.matlab/local_scheduler_data/

• Easy fix: redirect directory to /scratch.local

Page 41: The Campus Cluster. What is the Campus Cluster? Batch job system High throughput High latency Available resources: – ~450 nodes – 12 Cores/node – 24-96.

Parallel For Loops on the Cluster

1. Setup (done once, before submitting jobs): [iendres2 ~]$ ln –sv /scratch.local/tmp/USER/matlab/local_scheduler_data

~/.matlab/local_scheduler_data(Replace USER with your netid)

Page 42: The Campus Cluster. What is the Campus Cluster? Batch job system High throughput High latency Available resources: – ~450 nodes – 12 Cores/node – 24-96.

Parallel For Loops on the Cluster2. Wrap matlabpool function to make sure tmp data exists:

function matlabpool_robust(varargin)

if(matlabpool('size')>0)   matlabpool closeend

% make sure the directories exist and are empty for good measuresystem('rm -rf /scratch.local/tmp/USER/matlab/local_scheduler_data');

system(sprintf('mkdir -p /scratch.local/tmp/USER/matlab/local_scheduler_data/R%s', version('-release')));

% Run it:

matlabpool (varargin{:});Warning:/scratch.local may get filled up by other users, in which case this will fail.

Page 43: The Campus Cluster. What is the Campus Cluster? Batch job system High throughput High latency Available resources: – ~450 nodes – 12 Cores/node – 24-96.

Best Practices

• Interactive Sessions– Don’t leave idle sessions open, it ties up the nodes

• Job arrays– Still working on kinks in the scheduler, I managed

to kill the whole cluster• Disk I/O– Minimize I/O for best performance– Avoid small reads and writes due to metadata

overhead

Page 44: The Campus Cluster. What is the Campus Cluster? Batch job system High throughput High latency Available resources: – ~450 nodes – 12 Cores/node – 24-96.

Maintenance

• “Preventive maintenance (PM) on the cluster is generally scheduled on a monthly basis on the third Wednesday of each month from 8 a.m. to 8 p.m. Central Time. The cluster will be returned to service earlier if maintenance is completed before schedule.”

Page 45: The Campus Cluster. What is the Campus Cluster? Batch job system High throughput High latency Available resources: – ~450 nodes – 12 Cores/node – 24-96.

Resources

• Beginner’s guide:https://campuscluster.illinois.edu/user_info/doc/beginner.html

• More comprehensive user’s guide: http://campuscluster.illinois.edu/user_info/doc/index.html

• Cluster Monitor:http://clustat.ncsa.illinois.edu/taub/

• Simple sample job scripts/projects/consult/pbs/

• Forumhttps://campuscluster.illinois.edu/forum/