Advanced Odyssey Training - software.rc.fas.harvard.edu€¦ · Typical Odyssey Workflow • Login...

34
Advanced Odyssey Training Paul Edmon FAS Research Computing Harvard University For copy of slides go to: http://software.rc.fas.harvard.edu/training

Transcript of Advanced Odyssey Training - software.rc.fas.harvard.edu€¦ · Typical Odyssey Workflow • Login...

Page 1: Advanced Odyssey Training - software.rc.fas.harvard.edu€¦ · Typical Odyssey Workflow • Login to login node • Set up runs for cluster – Make sure data is staged on storage.

Advanced Odyssey Training

Paul Edmon FAS Research Computing

Harvard University

For copy of slides go to: http://software.rc.fas.harvard.edu/training

Page 2: Advanced Odyssey Training - software.rc.fas.harvard.edu€¦ · Typical Odyssey Workflow • Login to login node • Set up runs for cluster – Make sure data is staged on storage.

Outline

• Odyssey

• SLURM

• File Systems

• Best Practices

Page 3: Advanced Odyssey Training - software.rc.fas.harvard.edu€¦ · Typical Odyssey Workflow • Login to login node • Set up runs for cluster – Make sure data is staged on storage.

What is Odyssey?

• Harvard Supercomputer

• Original Odyssey: 4000 cores

• Odyssey 2: 28,000 cores

• Total Odyssey: 54,000 cores

Page 4: Advanced Odyssey Training - software.rc.fas.harvard.edu€¦ · Typical Odyssey Workflow • Login to login node • Set up runs for cluster – Make sure data is staged on storage.

What is Odyssey?

• Heterogeneous

• LSF (CentOS 5) vs. SLURM (CentOS 6)

• Owned vs. General Queues

• 1 Summer Street, 60 Oxford and Holyoke

Page 5: Advanced Odyssey Training - software.rc.fas.harvard.edu€¦ · Typical Odyssey Workflow • Login to login node • Set up runs for cluster – Make sure data is staged on storage.

Network

• 1 GbE Ethernet

• 10 GbE Ethernet

• Infiniband – QDR (Quad Data Rate):

32 Gb/s

– FDR-10 (Fourteen Data Rate): 41 Gb/s

– FDR: 54 Gb/s

Page 6: Advanced Odyssey Training - software.rc.fas.harvard.edu€¦ · Typical Odyssey Workflow • Login to login node • Set up runs for cluster – Make sure data is staged on storage.

Typical Odyssey Workflow

• Login to login node

• Set up runs for cluster – Make sure data is staged on storage. This can be done by submitting a job to transfer the data.

– Set up job submission script to proper specifications and queue.

• Submit Job to Cluster

• Wait Until Job is Finished – Job will pend until space is found and the user has sufficient priority

– Job will run against the storage specified

• Move data to permanent storage or look at result.

• WARNING: One should not run computationally, memory or I/O intensive programs or scripts on the login nodes. Use an interactive run instead to do these things. The login nodes are slow and not the same hardware as the compute nodes. We also have a script which culls offending jobs.

Page 7: Advanced Odyssey Training - software.rc.fas.harvard.edu€¦ · Typical Odyssey Workflow • Login to login node • Set up runs for cluster – Make sure data is staged on storage.

Node Design

• Legacy Hardware – Variety of chipsets mostly Intel

– Being retired due to age

• Odyssey 2

– AMD Abu Dhabi • Integer Cores (IC) vs. Floating Point

Units (FPU)

• Supports AVX2 and FMA4

• Processor: 16 IC, 8 FPU

• 4 sockets, 256 GB per node

– 250 GB of local scratch at /scratch

– Chassis is made up of 8 blades

Page 8: Advanced Odyssey Training - software.rc.fas.harvard.edu€¦ · Typical Odyssey Workflow • Login to login node • Set up runs for cluster – Make sure data is staged on storage.

SLURM

• Simple Linux Utility for Resource Management – Open Source managed by SchedMD

– Used by the largest supercomputers in the world

• Head node runs – Slurmctld: Control Daemon

– Slurmdbd: Database

• Client nodes run – Slurmd: Client Daemon

• Submission nodes do not run a daemon but rather just look at the configuration file.

• Users are limited to 10,000 jobs in the queue at a time. A maximum of 150,000 jobs are permitted for all users.

Page 9: Advanced Odyssey Training - software.rc.fas.harvard.edu€¦ · Typical Odyssey Workflow • Login to login node • Set up runs for cluster – Make sure data is staged on storage.

Available Queues

Page 10: Advanced Odyssey Training - software.rc.fas.harvard.edu€¦ · Typical Odyssey Workflow • Login to login node • Set up runs for cluster – Make sure data is staged on storage.

Available Queues

• general – 7 day time limit – 20,032 cores (313 nodes) available for use – 256 GB per node – FDR-10 IB interconnect – holyscratch available via IB

• interact

– 3 day time limit – Interactive jobs – 128 cores (2 nodes) available for use – Otherwise same as general

Page 11: Advanced Odyssey Training - software.rc.fas.harvard.edu€¦ · Typical Odyssey Workflow • Login to login node • Set up runs for cluster – Make sure data is staged on storage.

Available Queues

• unrestricted – No time limit – Non-exclusive jobs – 512 cores (8 nodes) available for use – Otherwise same as general

• serial_requeue

– 1 day time limit – Requeue – Jobs restricted to a single node – 15,952 cores (431 nodes) available for use – Variety of hardware and chipsets. Some GPU nodes and some 512 GB nodes

are available.

• Group Specific

Page 12: Advanced Odyssey Training - software.rc.fas.harvard.edu€¦ · Typical Odyssey Workflow • Login to login node • Set up runs for cluster – Make sure data is staged on storage.

Scheduling Policy

• Fairshare – Done on a per user basis

– Fairshare half life is 2 days

– Shares are equal for all users

– New fairshare structure will be coming in the future

– Higher fairshare means higher Priority

• Time in Queue

– Priority scales linearly. Priority maximum 7 days.

• Total Priority

– P = p*1E9 + f*2E8 + t*1E3 • Where p is partition priority, f is fairshare, t is time and P is total priority

Page 13: Advanced Odyssey Training - software.rc.fas.harvard.edu€¦ · Typical Odyssey Workflow • Login to login node • Set up runs for cluster – Make sure data is staged on storage.

Backfill

• Starts with highest priority jobs and tries to wedge them in as it is able

• Use –t, --mem and –n to sneak in jobs.

• Low priority jobs can sneak in if they can backfill.

Pending Queue Cluster

Time

A

B

Priority

A

B

Page 14: Advanced Odyssey Training - software.rc.fas.harvard.edu€¦ · Typical Odyssey Workflow • Login to login node • Set up runs for cluster – Make sure data is staged on storage.

Useful Slurm Commands

• sinfo: Shows you information on partitions that are available.

• squeue: Shows queue information on jobs currently running.

• showq-slurm: Shows current queue state.

• sacct: Shows information about jobs that you have run in the past.

• sshare: Shows current fairshare.

• sprio: Shows current job priority.

• sdiag: Shows current scheduler state.

• sbatch: Submit batch jobs.

• srun: Used for interactive jobs.

• scancel: Cancels jobs.

Page 15: Advanced Odyssey Training - software.rc.fas.harvard.edu€¦ · Typical Odyssey Workflow • Login to login node • Set up runs for cluster – Make sure data is staged on storage.

Interactive

• srun -n 1 --pty --x11=first –p interact --mem=1000 bash

• Interactive mode works for any queue but interact is set aside for dedicated interactive use such as: – Debugging

– GUI

– Visualization

• This command can be run from the login nodes.

Page 16: Advanced Odyssey Training - software.rc.fas.harvard.edu€¦ · Typical Odyssey Workflow • Login to login node • Set up runs for cluster – Make sure data is staged on storage.

Job Arrays

#!/bin/bash

#SBATCH –n 1

#SBATCH --array=1-10

#SBATCH –p computefest

#SBATCH –t 10

#SBATCH --mem=100

echo $SLURM_ARRAY_TASK_ID

Page 17: Advanced Odyssey Training - software.rc.fas.harvard.edu€¦ · Typical Odyssey Workflow • Login to login node • Set up runs for cluster – Make sure data is staged on storage.

Reservations

• Sets aside a block of compute on a temporary basis

• Usage of reservation is restricted to defined users and counts against fairshare

• Can be set to be reoccurring and is good for on demand compute or debugging

• The longer the lead time we have the better to set up the reservation and it requires approval by RC management

Page 18: Advanced Odyssey Training - software.rc.fas.harvard.edu€¦ · Typical Odyssey Workflow • Login to login node • Set up runs for cluster – Make sure data is staged on storage.

Other SBATCH Options

• --contiguous – Ensures that the nodes you receive are adjacent

• --dependency=(see manual for options)

– Will delay job execution until a previous job completes

• --distribution=(see manual for options)

– Allows you to control how the ranks are distributed over the nodes you receive

• --ntasks-per-node

– Specifies how many tasks you want to land on each node

• --nodelist=nodelist, --exclude=nodelist

– Allows you to specify a particular node to include or exclude from the run

Page 19: Advanced Odyssey Training - software.rc.fas.harvard.edu€¦ · Typical Odyssey Workflow • Login to login node • Set up runs for cluster – Make sure data is staged on storage.

SLURM Best Practices

• Submit Test Jobs

• Use --mem, see: https://rc.fas.harvard.edu/kb/high-performance-computing/slurm-memory/

• Use –t

• Have checkpoints

• Do not submit tons of jobs

• Package your jobs efficiently

• Make sure that if you are running threads use --ntasks-per-node to make sure all the cores end up on the same node.

• Try to find the best way to get through the queue. We want you to game the scheduler.

Page 20: Advanced Odyssey Training - software.rc.fas.harvard.edu€¦ · Typical Odyssey Workflow • Login to login node • Set up runs for cluster – Make sure data is staged on storage.

SLURM Best Practices

• If things go wrong or break then try the following: – Did you request enough memory, processors or time?

– Did your code fail due to running out of filesystem space?

– Did you properly debug your code?

– Try solving the problem yourself prior to contacting us. Odds

are the failure is on your end not with the cluster or scheduler. You know your specific use case and code better than we do.

– Check the current queue state if your job is pending. There may be jobs ahead of you or no current space on the cluster.

Page 21: Advanced Odyssey Training - software.rc.fas.harvard.edu€¦ · Typical Odyssey Workflow • Login to login node • Set up runs for cluster – Make sure data is staged on storage.

SLURM Best Practices

• If you have a problem that you can’t figure out with a run send us: – Job ID and username

– Job Script

– Any filesystems you were using for the run and what they were used for.

– Modules used

– Queue used

– Any output and error files that were produced

– Any general information about the code or about the failure that you think would be relevant

and any debugging you have already done.

– Be as specific as possible. We are not mind readers, the more details you can provide the better.

Page 22: Advanced Odyssey Training - software.rc.fas.harvard.edu€¦ · Typical Odyssey Workflow • Login to login node • Set up runs for cluster – Make sure data is staged on storage.

Filesystems

• Tiered Filesystem Structure

• Owned Storage

• Home

• Scratch – Local vs. Global

• Three Different Types of Remote Storage – NFS

– Gluster

– Lustre

Page 23: Advanced Odyssey Training - software.rc.fas.harvard.edu€¦ · Typical Odyssey Workflow • Login to login node • Set up runs for cluster – Make sure data is staged on storage.

NFS

• Network File System

• Not designed for high I/O traffic

• Can take about 40 concurrent writes/reads before becoming system starts to slow down significantly

• High loads can cause NFS to lock up and crash

• Good for low impact usage and looking around

• Most filesystems on Odyssey are mounted via NFS

Page 24: Advanced Odyssey Training - software.rc.fas.harvard.edu€¦ · Typical Odyssey Workflow • Login to login node • Set up runs for cluster – Make sure data is staged on storage.

Gluster

• GNU cluster

• Distributed File System

• Distributed Hash Table

• Good for moderate production I/O

• holyscratch

Page 25: Advanced Odyssey Training - software.rc.fas.harvard.edu€¦ · Typical Odyssey Workflow • Login to login node • Set up runs for cluster – Make sure data is staged on storage.

Lustre

• Distributed Filesystem

• Meta-data and Objects

• MDS: Meta-data Server • OSS: Object Storage Server

– OST: Object Storage Target

• Good for high performance I/O

• scratch2 and regal – Restricted to a on need basis

Page 26: Advanced Odyssey Training - software.rc.fas.harvard.edu€¦ · Typical Odyssey Workflow • Login to login node • Set up runs for cluster – Make sure data is staged on storage.

Example I/O Patterns

• Multiple Nodes writing or reading concurrently: holyscratch or regal

• I/O bounded compute on files less than 250 GB in size: /scratch

• I/O bounded compute on files greater than 250 GB in size: Local storage or regal

Page 27: Advanced Odyssey Training - software.rc.fas.harvard.edu€¦ · Typical Odyssey Workflow • Login to login node • Set up runs for cluster – Make sure data is staged on storage.

Filesystem Best Practices

• Always think about what filesystem best suits your I/O needs

• Do not use Home or NFS filesystems for intensive I/O. Namely do not point thousands of jobs at those filesystems.

• Remember scratch areas are for temporary storage only. They are not backed up. Also most have retention policies.

• Use local scratch on nodes for best performance

Page 28: Advanced Odyssey Training - software.rc.fas.harvard.edu€¦ · Typical Odyssey Workflow • Login to login node • Set up runs for cluster – Make sure data is staged on storage.

Filesystem Best Practices

• Always store redundant copies of critical data. Super critical data should go to home.

• Nothing on the cluster should be considered safe forever. The cluster is not a long term storage solution. Once you are done with your science store the data in a more permanent location or be able to regenerate the data.

• Data should always be reproducible (good science always is). Data that is not should be backed up.

• Often technology improves to the point where it is faster to just regenerate the data (if synthetic) than to store it so always weigh which data you actually need to keep.

• Symlinks are useful, but don’t overuse them. Try to code using relative paths and avoid using symlinks where possible. They can obfuscate problems.

Page 29: Advanced Odyssey Training - software.rc.fas.harvard.edu€¦ · Typical Odyssey Workflow • Login to login node • Set up runs for cluster – Make sure data is staged on storage.

Filesystem Best Practices

• If something goes wrong, as with SLURM try to debug it yourself first. – File system hang: Contact RC as the filesystem in question is either

under high load or fell over

– File errors: Possible corruption or filesystem issue contact RC

– Out of space: Delete data, compress data, or move to new location

– Out of inodes: Delete data, tar data, or move data to new location

– Error Read/Write: Could be filesystem or program, verify that you code is working prior to contacting RC

Page 30: Advanced Odyssey Training - software.rc.fas.harvard.edu€¦ · Typical Odyssey Workflow • Login to login node • Set up runs for cluster – Make sure data is staged on storage.

Filesystem Best Practices

• If you can’t debug it yourself contact us and provide us with the following: – Username, node, and filesystem in question

– Full path to the data where the issues are occurring, not the symlinked

path if it is symlinks

– A description of what you were doing and what the error was

– Any error or output logs you have

– Any debugging you have already done

– Be specific as possible

Page 31: Advanced Odyssey Training - software.rc.fas.harvard.edu€¦ · Typical Odyssey Workflow • Login to login node • Set up runs for cluster – Make sure data is staged on storage.

Ganglia Monitoring

• We have our performance monitoring software outputting to a public website: – status.rc.fas.harvard.edu

• This is great for seeing if a filesystem you are using is overloaded, or what the state of the cluster is, or if a node you are running on is having issues.

• It’s a good first diagnostic for many problems related to cluster load besides looking at the SLURM queue.

Page 32: Advanced Odyssey Training - software.rc.fas.harvard.edu€¦ · Typical Odyssey Workflow • Login to login node • Set up runs for cluster – Make sure data is staged on storage.

General Cluster Best Practice

• Always, always, always test your code

• Always, always, always check your answers

• Always, always, always think before you submit to the cluster

• Always, always, always try to debug your problems yourself before contacting RC.

Page 33: Advanced Odyssey Training - software.rc.fas.harvard.edu€¦ · Typical Odyssey Workflow • Login to login node • Set up runs for cluster – Make sure data is staged on storage.

General Cluster Best Practice

• Try to use the minimum amount of compute needed

• Try to have checkpoints

• Use appropriate filesystems for I/O pattern

• Minimize the number of jobs submitted to the scheduler

• Be cognizant of where you are in the queue and leverage backfill

• Try to game the queue. We want you to find the best way to run for you so experiment and figure out how best to push your jobs through.

Page 34: Advanced Odyssey Training - software.rc.fas.harvard.edu€¦ · Typical Odyssey Workflow • Login to login node • Set up runs for cluster – Make sure data is staged on storage.

Questions, Comments, Concerns?

Contact [email protected] for help.