Advanced Odyssey Training - software.rc.fas.harvard.edu€¦ · Typical Odyssey Workflow • Login...

Advanced Odyssey Training

Paul Edmon FAS Research Computing

Harvard University

For copy of slides go to: http://software.rc.fas.harvard.edu/training

Outline

• Odyssey

• SLURM

• File Systems

• Best Practices

What is Odyssey?

• Harvard Supercomputer

• Original Odyssey: 4000 cores

• Odyssey 2: 28,000 cores

• Total Odyssey: 54,000 cores

What is Odyssey?

• Heterogeneous

• LSF (CentOS 5) vs. SLURM (CentOS 6)

• Owned vs. General Queues

• 1 Summer Street, 60 Oxford and Holyoke

Network

• 1 GbE Ethernet

• 10 GbE Ethernet

• Infiniband – QDR (Quad Data Rate):

32 Gb/s

– FDR-10 (Fourteen Data Rate): 41 Gb/s

– FDR: 54 Gb/s

Typical Odyssey Workflow

• Login to login node

• Set up runs for cluster – Make sure data is staged on storage. This can be done by submitting a job to transfer the data.

– Set up job submission script to proper specifications and queue.

• Submit Job to Cluster

• Wait Until Job is Finished – Job will pend until space is found and the user has sufficient priority

– Job will run against the storage specified

• Move data to permanent storage or look at result.

• WARNING: One should not run computationally, memory or I/O intensive programs or scripts on the login nodes. Use an interactive run instead to do these things. The login nodes are slow and not the same hardware as the compute nodes. We also have a script which culls offending jobs.

Node Design

• Legacy Hardware – Variety of chipsets mostly Intel

– Being retired due to age

• Odyssey 2

– AMD Abu Dhabi • Integer Cores (IC) vs. Floating Point

Units (FPU)

• Supports AVX2 and FMA4

• Processor: 16 IC, 8 FPU

• 4 sockets, 256 GB per node

– 250 GB of local scratch at /scratch

– Chassis is made up of 8 blades

SLURM

• Simple Linux Utility for Resource Management – Open Source managed by SchedMD

– Used by the largest supercomputers in the world

• Head node runs – Slurmctld: Control Daemon

– Slurmdbd: Database

• Client nodes run – Slurmd: Client Daemon

• Submission nodes do not run a daemon but rather just look at the configuration file.

• Users are limited to 10,000 jobs in the queue at a time. A maximum of 150,000 jobs are permitted for all users.

Available Queues

Available Queues

• general – 7 day time limit – 20,032 cores (313 nodes) available for use – 256 GB per node – FDR-10 IB interconnect – holyscratch available via IB

• interact

– 3 day time limit – Interactive jobs – 128 cores (2 nodes) available for use – Otherwise same as general

Available Queues

• unrestricted – No time limit – Non-exclusive jobs – 512 cores (8 nodes) available for use – Otherwise same as general

• serial_requeue

– 1 day time limit – Requeue – Jobs restricted to a single node – 15,952 cores (431 nodes) available for use – Variety of hardware and chipsets. Some GPU nodes and some 512 GB nodes

are available.

• Group Specific

Scheduling Policy

• Fairshare – Done on a per user basis

– Fairshare half life is 2 days

– Shares are equal for all users

– New fairshare structure will be coming in the future

– Higher fairshare means higher Priority

• Time in Queue

– Priority scales linearly. Priority maximum 7 days.

• Total Priority

– P = p*1E9 + f*2E8 + t*1E3 • Where p is partition priority, f is fairshare, t is time and P is total priority

Backfill

• Starts with highest priority jobs and tries to wedge them in as it is able

• Use –t, --mem and –n to sneak in jobs.

• Low priority jobs can sneak in if they can backfill.

Pending Queue Cluster

Time

A

B

Priority

A

B

Useful Slurm Commands

• sinfo: Shows you information on partitions that are available.

• squeue: Shows queue information on jobs currently running.

• showq-slurm: Shows current queue state.

• sacct: Shows information about jobs that you have run in the past.

• sshare: Shows current fairshare.

• sprio: Shows current job priority.

• sdiag: Shows current scheduler state.

• sbatch: Submit batch jobs.

• srun: Used for interactive jobs.

• scancel: Cancels jobs.

Interactive

• srun -n 1 --pty --x11=first –p interact --mem=1000 bash

• Interactive mode works for any queue but interact is set aside for dedicated interactive use such as: – Debugging

– GUI

– Visualization

• This command can be run from the login nodes.

Job Arrays

#!/bin/bash

#SBATCH –n 1

#SBATCH --array=1-10

#SBATCH –p computefest

#SBATCH –t 10

#SBATCH --mem=100

echo $SLURM_ARRAY_TASK_ID

Reservations

• Sets aside a block of compute on a temporary basis

• Usage of reservation is restricted to defined users and counts against fairshare

• Can be set to be reoccurring and is good for on demand compute or debugging

• The longer the lead time we have the better to set up the reservation and it requires approval by RC management

Other SBATCH Options

• --contiguous – Ensures that the nodes you receive are adjacent

• --dependency=(see manual for options)

– Will delay job execution until a previous job completes

• --distribution=(see manual for options)

– Allows you to control how the ranks are distributed over the nodes you receive

• --ntasks-per-node

– Specifies how many tasks you want to land on each node

• --nodelist=nodelist, --exclude=nodelist

– Allows you to specify a particular node to include or exclude from the run

SLURM Best Practices

• Submit Test Jobs

• Use --mem, see: https://rc.fas.harvard.edu/kb/high-performance-computing/slurm-memory/

• Use –t

• Have checkpoints

• Do not submit tons of jobs

• Package your jobs efficiently

• Make sure that if you are running threads use --ntasks-per-node to make sure all the cores end up on the same node.

• Try to find the best way to get through the queue. We want you to game the scheduler.

https://rc.fas.harvard.edu/kb/high-performance-computing/slurm-memory/








• If things go wrong or break then try the following: – Did you request enough memory, processors or time?

– Did your code fail due to running out of filesystem space?

– Did you properly debug your code?

– Try solving the problem yourself prior to contacting us. Odds

are the failure is on your end not with the cluster or scheduler. You know your specific use case and code better than we do.

– Check the current queue state if your job is pending. There may be jobs ahead of you or no current space on the cluster.


• If you have a problem that you can’t figure out with a run send us: – Job ID and username

– Job Script

– Any filesystems you were using for the run and what they were used for.

– Modules used

– Queue used

– Any output and error files that were produced

– Any general information about the code or about the failure that you think would be relevant

and any debugging you have already done.

– Be as specific as possible. We are not mind readers, the more details you can provide the better.

Filesystems

• Tiered Filesystem Structure

• Owned Storage

• Home

• Scratch – Local vs. Global

• Three Different Types of Remote Storage – NFS

– Gluster

– Lustre

NFS

• Network File System

• Not designed for high I/O traffic

• Can take about 40 concurrent writes/reads before becoming system starts to slow down significantly

• High loads can cause NFS to lock up and crash

• Good for low impact usage and looking around

• Most filesystems on Odyssey are mounted via NFS

Gluster

• GNU cluster

• Distributed File System

• Distributed Hash Table

• Good for moderate production I/O

• holyscratch

Lustre

• Distributed Filesystem

• Meta-data and Objects

• MDS: Meta-data Server • OSS: Object Storage Server

– OST: Object Storage Target

• Good for high performance I/O

• scratch2 and regal – Restricted to a on need basis

Example I/O Patterns

• Multiple Nodes writing or reading concurrently: holyscratch or regal

• I/O bounded compute on files less than 250 GB in size: /scratch

• I/O bounded compute on files greater than 250 GB in size: Local storage or regal

Filesystem Best Practices

• Always think about what filesystem best suits your I/O needs

• Do not use Home or NFS filesystems for intensive I/O. Namely do not point thousands of jobs at those filesystems.

• Remember scratch areas are for temporary storage only. They are not backed up. Also most have retention policies.

• Use local scratch on nodes for best performance


• Always store redundant copies of critical data. Super critical data should go to home.

• Nothing on the cluster should be considered safe forever. The cluster is not a long term storage solution. Once you are done with your science store the data in a more permanent location or be able to regenerate the data.

• Data should always be reproducible (good science always is). Data that is not should be backed up.

• Often technology improves to the point where it is faster to just regenerate the data (if synthetic) than to store it so always weigh which data you actually need to keep.

• Symlinks are useful, but don’t overuse them. Try to code using relative paths and avoid using symlinks where possible. They can obfuscate problems.


• If something goes wrong, as with SLURM try to debug it yourself first. – File system hang: Contact RC as the filesystem in question is either

under high load or fell over

– File errors: Possible corruption or filesystem issue contact RC

– Out of space: Delete data, compress data, or move to new location

– Out of inodes: Delete data, tar data, or move data to new location

– Error Read/Write: Could be filesystem or program, verify that you code is working prior to contacting RC


• If you can’t debug it yourself contact us and provide us with the following: – Username, node, and filesystem in question

– Full path to the data where the issues are occurring, not the symlinked

path if it is symlinks

– A description of what you were doing and what the error was

– Any error or output logs you have

– Any debugging you have already done

– Be specific as possible

Ganglia Monitoring

• We have our performance monitoring software outputting to a public website: – status.rc.fas.harvard.edu

• This is great for seeing if a filesystem you are using is overloaded, or what the state of the cluster is, or if a node you are running on is having issues.

• It’s a good first diagnostic for many problems related to cluster load besides looking at the SLURM queue.

General Cluster Best Practice

• Always, always, always test your code

• Always, always, always check your answers

• Always, always, always think before you submit to the cluster

• Always, always, always try to debug your problems yourself before contacting RC.

General Cluster Best Practice

• Try to use the minimum amount of compute needed

• Try to have checkpoints

• Use appropriate filesystems for I/O pattern

• Minimize the number of jobs submitted to the scheduler

• Be cognizant of where you are in the queue and leverage backfill

• Try to game the queue. We want you to find the best way to run for you so experiment and figure out how best to push your jobs through.

Questions, Comments, Concerns?

Contact [email protected] for help.

mailto:[email protected]

Advanced Odyssey Training - software.rc.fas.harvard.edu€¦ · Typical Odyssey Workflow • Login...

Documents

Transcript of Advanced Odyssey Training - software.rc.fas.harvard.edu€¦ · Typical Odyssey Workflow • Login...