HPC Workshop: Working on Soroban - FU-Berlin ZEDAT · PDF fileHPC Workshop: Working on Soroban...

HPC Workshop: Working on Soroban

Dr. L. Bennett

ZEDAT, FU Berlin

SS 2015

Dr. L. Bennett (ZEDAT, FU Berlin) HPC Workshop: Working on Soroban SS 2015 1 / 34

Outline

1 IntroductionGoalsResources

2 Before Running JobsPreparationBatch System

3 While Running JobsQueueProcesses

4 After Running JobsResources UsedTidying Up


Introduction Goals

Overview

Your goals

Complete a certain task in a certain time

That's (probably) it!

Our goals

Manage resources such that the largest number of people can achievetheir goals

Provide resourcesHelp users make good use of resources


Introduction Resources

Soroban



Overview

Limited resourcescores

memory

disk-space

licences

Intel compilerMATLAB / MATLAB toolboxes

graphics processing units (GPUs)

Unlimited resources

software



Main Limited & Consumable Resources

Cores

1344 cores in total

12 cores per node

Memory

5.25 TB in total

24, 48 or 96 GB per node

18, 42 or 90 GB available for users (6 GB reserved for OS)



Main Limited & Nonconsumable Resource

Disk space

local disks

none

distributed �le systems

16 TB /home

174 TB /scratch

Limits

total size of the �le-systems

how well the admins monitor usage



Bottlenecks

Comparison of resources

Resource Bottleneck? Commentcores sometimes many cores often unused,

but few on individual nodesmemory often users often overestimatedisk space not usually we keep an eye on disk usagedisk access occasionally IO may cross critical threshold



File Systems

/home

16 TB

NFS

backup (except tmp, temp)

scriptsresults

high metadata performance

good for reading/writingmany �les

/scratch

174 TB

FhGFS

no backup

temporary datacopies of input data

high I/O performance

good for reading/writinglarge �les



Useful commands

$ sinfo -eo "%30N %.5D %9P %11T %.6m %20E" or $ sinfo -Nl

NODELIST NODES PARTITION STATE CPUS MEMORY REASON

gpu01 1 gpu allocated 12 18000 none

node[001-002] 2 test allocated 12 42000 none

node003 1 main* idle 12 18000 none

node[004-024] 19 main* allocated 12 18000 none

node[025-034,036-100] 70 main* allocated 12 42000 none

node035 6 main* draining 12 42000 large uptime

node[101-111] 11 main* allocated 12 90000 none

node112 1 main* drained 12 90000 large uptime

$ df -h /home /scratch

Filesystem Size Used Avail Use% Mounted on

master.ib.cluster:/home 17T 13T 3.7T 78% /home

fhgfs_nodev 164T 111T 54T 68% /scratch



Software

Software can be . . .

directly in the operating system, e.g.

PerlPython

provided centrally via module (see next slides) , e.g.

GROMACSNAMD

in your /home directory

before you install here, check whether the software is already available

all three of the above, e.g.

R

not yet installed

we can install software centrally for you (others may also need it)



Modules I

$ module av...

gromacs/openmpi/gcc/64/4.5.4 stampy/1.0.23

gromacs/openmpi/gcc/64/4.5.5 symmetree/1.1

hpl/2.0 tophat/2.0.13

iozone/3_373 trinity/20140413p1

java/1.8.0 turbomole/6.5(default)

lammps/11Jan12-openmpi turbomole/6.5mpi

mafft/7.205 turbomole/6.5smp

matlab/R2011b vasp/5.2.12(default)

matlab/R2012b vasp/5.3.3

matlab/R2014a vesta/3

matlab/R2014b vmd/1.9.1

migrate-n/3.6 wien2k/11_32bit

mira/3.4.0 wien2k/11_64bit

$ module av fsl

------------------------ /cm/shared/modulefiles/production ---------------------

fsl/4.1.9 fsl/5.0.0 fsl/5.0.1 fsl/5.0.7



Modules II

$ module help fsl

----------- Module Specific Help for 'fsl/5.0.7' ------------------

PROGRAM

FSL 5.0.7 - FMRIB Software Library

EXECUTABLES

This package encompasses a large number of executables. Please refer to the

documentation:

http://fsl.fmrib.ox.ac.uk/fsl/fslwiki/FslOverview

NOTES

Some executables may support multithreading.



Modules III

$ module show fsl-------------------------------------------------------------------

/cm/shared/modulefiles/production/fsl/5.0.7:

module-whatis FSL 5.0.7 - FMRIB Software Library

append-path PATH /cm/shared/apps/fsl/5.0.7/bin

setenv FSLDIR /cm/shared/apps/fsl/5.0.7

setenv FSLOUTPUTTYPE NIFTI_GZ

setenv FSLMULTIFILEQUIT TRUE

setenv FSLTCLSH /cm/shared/apps/fsl/5.0.7/bin/fsltclsh

setenv FSLWISH /cm/shared/apps/fsl/5.0.7/bin/fslwish

setenv FSLCONFDIR /cm/shared/apps/fsl/5.0.7/config

setenv FSLMACHTYPE `/cm/shared/apps/fsl/5.0.7/etc/fslconf/fslmachtype.sh`

-------------------------------------------------------------------

$ module add fsl / $ module rm fsl

$


Before Running Jobs Preparation

What to think about

Scope

What do I want to do?

How much time have I got?

Are su�cient resources, such as software, CPU-time, memory, diskspace available?

Skills

Do I have general Unix skills?

Do I have program-speci�c skills?

Do I know here to get help?


Before Running Jobs Preparation

Getting Help

Our website http://www.zedat.fu-berlin.de/HPC/Home

is in English and German

has general information and some information about speci�c programs

People in your group

may already be familiar with Soroban

may already have done similar calculations

may know other people who can help

People here in the HPC group

can be reached via email ([email protected]) or telephone

are happy to talk to you about your project face-to-face


Before Running Jobs Batch System

Overview

Slurm

simple Linux utility for resource management

allocates resources for jobs

provides framework for starting and monitoring jobs

works out job priorities

Basic work�ow

user submits job to Slurm via sbatch

Slurm calculates priorities for each job

Slurm starts jobs according to priority and available resources

(Slurm noti�es user on job completion)



Fairshare

Fairshare Factor

F = 2−U/S

where

F fairshare factor

U normalised usage

S total normalised shares

Thus

U > S implies F < 0.5U = S implies F = 0.5U < S implies F > 0.5 g



Priority

Contributing factors

P = wfF + wnNCPUs + waA

where

F fairshare

NCPUs percentage of CPUs requested

A age (time in queue)

wi weighting factors

wf = 1000000wn = 10000wa = 1000



Main Parameters

Speci�cation

Cores--nodes=1-1 exactly one node--nodes=2 at least 2 nodes--nodes=2-12 at least 2 nodes, at most 12 nodesMemory--mem=10240 memory per node in MB--mem-per-cpu=1024 memory per CPU in MBTime--time=00:30:00 maximum run-time (hr:min:sec)--time=2-12:00:00 maximum run-time (days-hr:min:sec)



Back�ll

Node requirements

Mechanism

Job A is running

Job B can only runwhen Job A ends

Job C can startbefore Job B if itends before Job A



Slurm Batch Script

Serial job#SBATCH [email protected]

#SBATCH --mail-type=end

#SBATCH --job-name=my_test_job_serial

#SBATCH --mem-per-cpu=2048

#SBATCH --time=08:00:00

cd /scratch/fakeuser/test

cp ~/input/test.input .

module add fakeprog

fakeprog -i test.input > test.out

cp test.out ~/results



Slurm Batch Script II

Multithreaded job#SBATCH [email protected]


#SBATCH --job-name=my_test_job_multithreaded



#SBATCH --ntasks=6

#SBATCH --nodes=1-1



module add fakeprog_mt

fakeprog_mt -n $SLURM_NTASKS -i test.input > test.out




Slurm Batch Script III

MPI parallel job#SBATCH [email protected]


#SBATCH --job-name=my_test_job_mpi



#SBATCH --ntasks=24



module add fakeprog_mpi

fakeprog_mpi -n $SLURM_NTASKS -i test.input > test.out




Useful commands

$ sbatch my_job.sh

Submitted batch job 25618

$ sprio -o "%.7i %8u %.10Y %.10A %.10F %.10J" | sort -nk3

JOBID USER PRIORITY AGE FAIRSHARE JOBSIZE

123496 alice 41861 24 31735 102

123468 bob 59108 21 49042 45

123467 carol 232189 1 222116 72

or $ sprio -l | sort -nk3


While Running Jobs Queue

Information on jobs

$ squeue -u fakeuser

JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)123456 main test_01 fakeuser R 7-02:49:44 4 node[019]123457 main test_02 fakeuser R 2-01:46:13 3 node[008-009,011]123458 main test_03 fakeuser PD 00:00 1 (Priority)

$ scontrol show job 123456JobId=123456 Name=test_01

UserId=fakeuser(111111) GroupId=agfake(999999)Priority=823131 Account=agfake QOS=normalJobState=RUNNING Reason=None Dependency=(null)Requeue=1 Restarts=1 BatchFlag=1 ExitCode=0:0RunTime=01:23:35 TimeLimit=2-12:00:00 TimeMin=N/ASubmitTime=2015-03-18T06:40:35 EligibleTime=2015-03-17T16:03:28StartTime=2015-03-18T06:40:37 EndTime=2015-03-21T05:40:37PreemptTime=None SuspendTime=None SecsPreSuspend=0Partition=main AllocNode:Sid=node005:98510ReqNodeList=(null) ExcNodeList=(null)NodeList=node019BatchHost=node019NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqS:C:T=*:*:*MinCPUsNode=1 MinMemoryNode=2750M MinTmpDiskNode=0Features=(null) Gres=(null) Reservation=(null)Shared=OK Contiguous=0 Licenses=(null) Network=(null)Command=/work/fakeuser/scaling_test/batch.shWorkDir=/scratch/fakeuser/scaling_test


While Running Jobs Processes

Tools

$ ps -flHu lorisF S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD5 S loris 23315 23309 0 80 0 - 31612 poll_s 10:28 ? 00:00:00 sshd: loris@pts/440 S loris 23316 23315 0 80 0 - 30621 wait 10:28 pts/44 00:00:00 -bash0 R loris 15717 23316 0 80 0 - 29655 - 11:12 pts/44 00:00:00

$ htop -u loris


After Running Jobs Resources Used

Memory

$ cat slurm-123456.outFri Dec 20 11:02:11 CET 2013

-- Slurm Task Epilog ----------------------------------------------------------node[040,042]:

JobID Memory Used Memory Requested STATUS, Comment on Slurm Memory------------ ----------- ---------------- -----------------------------------123456 1084 6000 COMPLETED, Memory too high

JobID MaxRSS TotalCPU Elapsed NTasks NCPUS MaxRSSNode State End------------ ----------- ------------- ----------- ------ ----- ---------- ---------- -------------------123456 41:22.170 00:10:25 4 COMPLETED 2013-12-20T11:01:40123456.batch 6620K 00:00.184 00:10:25 1 1 node040 COMPLETED 2013-12-20T11:01:40123456.0 1116196K 41:21.985 00:10:23 4 4 node040 COMPLETED 2013-12-20T11:01:40

SLURM_JOB_DERIVED_EC=0SLURM_JOB_EXIT_CODE=0COMPLETED

Why bother?

Job does not have to wait for memory it doesn't need

More memory available to other jobs (also yours!)


After Running Jobs Resources Used

Time

Email about time limitJobID Limit Requested RunTime Used STATUS, Comment on Slurm TimeLimit

------ --------------- ------------ ----------------------------------613062 23:59:00 07:17 COMPLETED, Limit too high613087 23:59:00 23:03 COMPLETED, Ok613128 23:59:00 06:22 COMPLETED, Limit too high613936 23:59:00 1-00:01:16 TIMEOUT(TimeLimit), Limit too low!614003 05:59:00 04:45 CANCELLED(COMPLETED), Limit too high614077 23:59:00 06:16 CANCELLED(COMPLETED), Limit too high614096 23:59:00 15:22 COMPLETED, Limit too high

Why bother?

Job can take advantage of back�ll

Before maintenance, maximum run-time is sometimes shortened


After Running Jobs Tidying Up

Data Management

Do

delete �les you no longer need

archive �les that you don't currently need

tar -czf old_simulations.tgz ./old_simulations

rm -rf ./old_simulations

move data o� Soroban

Don't

write a large volume of data to /home

write a large number of �les to /scratch

duplicate data

ignore mails about your data usage



Useful Commands

$ du -sh ~ /scratch/loris (this can take a while)

16G /home/loris

141G /scratch/loris



Comparison of Data Usage

/home

0

5000

10000

2013

−09−

14

2013

−11−

09

2014

−01−

04

2014

−03−

01

2014

−04−

26

2014

−06−

21

2014

−08−

16

2014

−10−

11

2014

−12−

06

2015

−01−

31

2015

−03−

28

date

disk

usa

ge [G

B]

/scratch

0

25000

50000

75000

100000

2013

−09−

15

2013

−11−

10

2014

−01−

05

2014

−03−

02

2014

−04−

27

2014

−06−

23

2014

−08−

18

2014

−10−

13

2014

−12−

08

2015

−02−

03

2015

−03−

31

date

disk

usa

ge [G

B]



Publications

Writing a paper?

Please acknowledge us in the paper, if time on Soroban has contributed toyour results

Had something published?

Please send us:

the bibliographical reference

a nice graphic

Why let us put it on our website?

We get a warm fuzzy feeling:Our time and the CPU-time wasn't wasted

The university management gets a warm fuzzy feeling:Money for HPC is well-spent



Leaving the group / university

Account expiry

If your FU account expires we shall

1 warn you and your group leader

2 wait for your FU account to be deleted (around 4 weeks)

3 delete your HPC account and all your data

4 inform your group leader about the deletion


HPC Workshop: Working on Soroban - FU-Berlin ZEDAT · PDF fileHPC Workshop: Working on Soroban...

Documents

Transcript of HPC Workshop: Working on Soroban - FU-Berlin ZEDAT · PDF fileHPC Workshop: Working on Soroban...