Post on 13-Dec-2015
description
3
Table of Contents
Table of Contents.............................................................................................................................3 1 Overview......................................................................................................................................5 1.1 This Guide ......................................................................................................................................... 5 1.2 WestGrid............................................................................................................................................ 5
2 Information for Grant Applicants .......................................................................................6 3 Information for Prospective Users .....................................................................................8 3.1 Who should consider using WestGrid? ................................................................................... 8 3.2 Am I eligible for an account?....................................................................................................... 8 3.3 How do I get an account?.............................................................................................................. 8 3.4 Is there a charge for using WestGrid?...................................................................................... 8 3.5 What hardware facilities does WestGrid have? ................................................................... 8 3.6 What software is available on WestGrid systems?.............................................................. 9 3.7 How much computing power is available to me? ................................................................ 9 3.8 What is the WestGrid computing environment like? ......................................................... 9 3.9 Is parallel programming required?.......................................................................................... 9 3.10 What experience do I need to use WestGrid?..................................................................... 9
4 Quick Start Guide for New Users ...................................................................................... 11 4.1 Getting Started...............................................................................................................................11 4.2 Choosing Which System to Use ................................................................................................11 4.3 Setting up Your Computer .........................................................................................................12 4.3.1 Terminal client supporting SSH........................................................................................................12 4.3.2 File transfer client supporting scp and sftp.................................................................................12 4.3.3 X Window display server for graphics...........................................................................................13
4.4 Connecting and Logging In ........................................................................................................13 4.5 Working Interactively.................................................................................................................14 4.5.1 The UNIX environment.........................................................................................................................14 4.5.2 File systems ...............................................................................................................................................15 4.5.3 Transferring files.....................................................................................................................................15 4.5.4 Editing files ................................................................................................................................................16 4.5.5 Running interactive programs ..........................................................................................................16 4.5.6 Restrictions on interactive jobs ........................................................................................................17
4.6 Software...........................................................................................................................................17 4.6.1 Locating installed software.................................................................................................................17 4.6.2 Installing your own software.............................................................................................................18 4.6.3 Requesting software installation .....................................................................................................18 4.6.4 Software licensing...................................................................................................................................18
4.7 Programming.................................................................................................................................18 4.8 Running Batch Jobs......................................................................................................................19 4.8.1 The batch environment ........................................................................................................................19 4.8.2 Batch job scripts ......................................................................................................................................19 4.8.3 Commands for submitting, monitoring and deleting jobs ....................................................20
4.9 Post-Processing.............................................................................................................................21 4.9.1 Managing files...........................................................................................................................................21
4
4.9.2 Visualization..............................................................................................................................................21 4.10 Usage Guidelines ........................................................................................................................21 4.10.1 Job limits ..................................................................................................................................................21 4.10.2 How much is reasonable? .................................................................................................................21 4.10.3 Job priorities and the fairshare policy.........................................................................................22 4.10.4 Resource Allocation Committee.....................................................................................................22 4.10.5 Accounting...............................................................................................................................................22
5 More information .................................................................................................................. 23 5.1 Getting Help....................................................................................................................................23 5.2 Local WestGrid Contacts ............................................................................................................24
5
1 Overview
1.1 This Guide The purpose of this guide is to provide an easy and quick way of accessing
information about WestGrid for researchers at the University of Manitoba. This information and much more are available at the WestGrid website, www.westgrid.ca. The guide is divided into three sections:
• Information for Grant Applicants This section is intended for researchers who are considering applying for grants for HPC equipment and outlines the policies set forth by CFI and Compute Canada.
• Information for Prospective Users This section contains answers to a set of common questions designed to help prospective users decide if they should use WestGrid.
• Quick Start Guide for New Users This section is intended to help new WestGrid users getting started with using WestGrid. It contains the basic information needed to start running jobs on the WestGrid systems.
1.2 WestGrid WestGrid is a consortium of Western Canadian universities and other
partners that provides high performance computing resources for Canadian research projects. It encompasses 14 partner institutions in British Columbia, Alberta, Saskatchewan, and Manitoba.
6
2 Information for Grant Applicants
Since CFI's creation of the National Platforms Fund (NPF) competition, new rules have been created that affect researchers' ability to apply for High Performance Computing (HPC) equipment, including clusters, shared memory multiprocessors, etc. CFI's goal in doing this was to avoid unnecessary replication of resources and inefficient use of resources and to try to ensure the effective management of allocated HPC resources. Unlike some other research equipment, HPC equipment is generally more easily shareable. Remote access is also possible, and effective, automated tools for sharing resources (i.e. scheduling systems) are readily available and deployed. Also, given the relatively short lifetime of HPC equipment, idle time is exceptionally wasteful. Since sharing HPC resources is simple, in most cases, it therefore makes sense to have multiple researchers share a few large HPC resources (like those provided by WestGrid) rather than support many smaller, dedicated ones. Finally, providing effective management of HPC resources is very problematic and this has a direct impact on research effectiveness. Given the level of support available through NSERC and other granting agencies for technical staff, it makes more sense to provide support "centrally" than attempt to do so for many small research-‐lab sized HPC systems. Using graduate students to perform system support as is often done for smaller systems is not only ineffective, but it prevents them from making timely progress on their own research.
As a result, in 2008, CFI introduced certain restrictions on applications for HPC equipment. Quoting, with some minor additions enclosed by "[]", from the Compute Canada website (www.computecanada.org):
"For projects requesting computing equipment in excess of $500,000, the CFI will require a letter from the applicant's institution detailing the reason(s) why the current high performance computing resources offered by Compute Canada [(i.e. the member consortia, including WestGrid)] are unable to meet the needs of the project. The CFI expects this letter to be submitted only after there have been substantial discussions between the applicant and Compute Canada. The CFI will then make a final decision regarding the need for the proposed computing equipment. The CFI may seek expert advice to assist with difficult or ambiguous cases."
Compute Canada also provides a somewhat more detailed document describing the process used in 2008 to deal with non-‐consortia requests for HPC equipment valued over $500,000. It may be found at:
https://computecanada.org/__groups__/local.researchers/CFI-HPC_proposals_en.pdf
7
In general, the facilities provided by Compute Canada via the various HPC consortia in Canada are capable of supporting the vast majority of HPC researchers' needs. As such, the success of non-‐consortia requests for HPC equipment has been low. There are, however, a number of limitations to what WestGrid and the other consortia can provide due, primarily, to the shared nature of the facilities. These might provide the basis for a successful argument for CFI support for non-‐consortia based HPC infrastructure.
Currently, the consortia use batch-‐style job submission systems. Since most HPC problems run for days or weeks before producing results, this does not matter. Further, it supports much more effective use of the hardware itself. There are only very limited HPC facilities available supporting interactive use. If a researcher can make an argument that they need direct, interactive control over their running HPC applications, then this might be seen as a reason for a dedicated system. While direct access to facilities is certainly convenient, it is seldom required. Further, interactive "steering" of long-‐running HPC applications has personnel implications. A strong, very special argument would be necessary.
Being shared facilities, the systems provided by the consortia must be highly reliable. This provides another possible reason for requesting a dedicated HPC facility. If a researcher is creating their own custom software that would make an HPC system unstable for other users, then a WestGrid style share facility would be inappropriate. Making this argument would normally require that the custom software include modifications to the operating system kernel. Otherwise an argument for possible instability would be very weak. Further, such software development would commonly not require a large HPC system and therefore would be unlikely to cost over $500,000.
Shared HPC facilities must also be generally useful to a wide range of HPC users. This means that special purpose architectures (e.g. those based on the use of graphical processing units) are unlikely to be available. A researcher with a strong and clear need for such a special HPC architecture might request a dedicated facility. Given the unique nature of such systems, however, programming them is typically very challenging. As such, any such application would need to be able to argue that specialized technical support staff would be available to effectively use the equipment.
In rare circumstances, an argument for a dedicated facility might also be made based on issues related to software availability. This would require that it be impossible, for example due to strict licensing restrictions, to run software on a shared facility. Almost all software in current use, however, provides licensing options for shared facilities.
Finally, any request for dedicated HPC infrastructure, regardless of reason, will need to make a strong argument related to effectiveness of use. It will need to be clear to CFI that human resources will be available to support the equipment and ensure that it is readily available for use.
8
3 Information for Prospective Users
This section is adapted from www.westgrid.ca/support/prospective_users and contains answers to a set of common questions to help prospective users to judge whether to consider using WestGrid. More information can be obtained by contacting the support (see page 23 for contact information) or exploring the WestGrid website.
3.1 Who should consider using WestGrid? The WestGrid project is intended to facilitate research that depends on access
to computing resources that are beyond the means of the local resources of the individual researcher and to relieve the researcher of the burden of maintaining his or her own machine room. You should consider using WestGrid if your local computing environment presents fundamental barriers to advancement of your projects, due to such factors as limited numbers of machines, limited memory, inadequate disk space etc. In some cases, access to parallel processing to allow faster turnaround of individual jobs or more aggregate memory to enable larger jobs to be completed is the motivation.
3.2 Am I eligible for an account? WestGrid facilities are designated for Canadian researchers or those
collaborating on Canadian research projects. In general, any academic researcher from a Canadian research institution with significant high performance computing requirements to support his or her research may apply for an account on WestGrid. Students and research assistants require sponsorship from a faculty supervisor.
3.3 How do I get an account? Conditions of use and other details about accounts, including an online
application form are found at:
http://www.westgrid.ca/support/accounts
3.4 Is there a charge for using WestGrid? There is currently no charge for routine use of the WestGrid facilities. Charges
may apply for backup tapes or specialized software.
3.5 What hardware facilities does WestGrid have? A variety of commodity and high-‐performance clusters, large shared-‐memory
computers, and specialized storage, visualization and collaboration facilities are available. For detailed and up to date information about the various systems and advice on which one to use see:
http://www.westgrid.ca/resources_services
9
A single account application gives access to all but a few of the more specialized resources, for which a separate request may be required.
3.6 What software is available on WestGrid systems? Up to date lists of installed system software, compilers, mathematical and
other libraries, and application software are available on the WestGrid software page:
http://www.westgrid.ca/support/software
3.7 How much computing power is available to me? The answer is not straightforward, as there are many variables involved,
including which WestGrid system is being used, whether you have a small number of large runs or a large number of small runs, whether your project has been assigned special priority by the Resource Allocation Committee, etc. For an active user without unusually large memory or processor requirements, 10-‐30 processors may be obtained on a fairly regular basis.
3.8 What is the WestGrid computing environment like? All the WestGrid computers use a UNIX variant or Linux operating system.
Work such as job preparation, compilation, testing and debugging may be done interactively, but, the majority of the WestGrid computing resources are available only for production batch-‐oriented computing. A job script to run your program is written using a UNIX shell scripting language and submitted to the batch job handling system for assignment to a machine for running.
3.9 Is parallel programming required? Although some of the WestGrid computers are reserved for parallel
computing, there are legitimate reasons to run serial jobs. So, you are welcome to run serial code on those systems where it is permitted. WestGrid support staff can assist you in selecting the most appropriate systems on which to run your jobs and with parallelization of your code.
3.10 What experience do I need to use WestGrid? As a production high-‐performance computing environment, researchers have
a certain responsibility to use the WestGrid systems effectively. You are expected to learn the basics of UNIX file handling, how to transfer files, submit and monitor batch jobs, monitor your disk storage, etc. Many WestGrid users come from a Microsoft Windows background and so are not expected to be UNIX experts. WestGrid support analysts are happy to help you get started and assist you in learning to use the systems more effectively. You should be aware of the memory requirements of your job and be able to estimate such things as how long a job will take and how much disk space it will require. If you are developing code yourself, you are expected to optimize your code through appropriate choice of algorithm, compiler flags, and, in many cases, using optimized numerical
10
libraries. If you are using a discipline-‐specific package, you are expected to know how to prepare the input files, choose the appropriate options to apply the software to your particular problem, etc. The WestGrid environment is not particularly good for learning how to use software.
11
4 Quick Start Guide for New Users
This section is adapted from www.westgrid.ca/support/quickstart/new_users and is intended to help new WestGrid users find basic information needed to start using WestGrid. It consists primarily of links to other pages on the WestGrid web site. If you do not have a WestGrid account yet, please read the previous section for prospective users.
4.1 Getting Started After an application for a WestGrid account has been approved, an email is
sent to the new user to direct him or her to some of the key pages on the WestGrid web site. This guide gives a more extensive list. We recommend that you go through all of the topics on this page, exploring the links to more detailed information on those particular subjects that are most relevant to you. It is useful to try things as you go along and ask questions to the support staff (see page 23 for contact information) if you encounter difficulties.
4.2 Choosing Which System to Use After a WestGrid account is received, there is typically a flurry of email
advising the account holder that his or her account is ready to use on particular WestGrid systems. One of these emails may include comments from the analyst who screened your account application about the special requirements section on the application form. Sometimes these comments include advice on which WestGrid system to use.
There are a number of factors to consider when choosing a system, the most important typically being whether or not your job runs in parallel and the amount of memory required per process. If the job can be run in parallel, an important criterion to use in determining where to run it is whether it can make use of multiple cluster nodes (such as when using MPI) or has to be run on a shared-‐memory machine (such as when OpenMP is used). Some general guidelines are:
• Small-‐memory serial jobs or undemanding parallel jobs are typically run on the Glacier and Robson clusters.
• OpenMP-‐based parallel jobs and large-‐memory (> 4GB) serial jobs may be run on the shared-‐memory architectures (Cortex and Nexus systems, with Cortex being the preferred starting point).
• For MPI-‐based parallel programs requiring a high-‐performance interconnect try the Matrix cluster. You may also like to compare to the systems available through Cortex.
12
• Jobs that require access to graphics/visualization hardware and software are typically run on Hydra.
• A commercial license for the Gaussian Chemistry software is only available on the Lattice cluster.
In other cases, availability of certain software libraries may dictate the system to use, but, it may be possible to work around such issues by installing additional software or substituting one library for another. Up to date lists of installed system software, compilers, mathematical and other libraries, and application software are available on the WestGrid software page:
http://www.westgrid.ca/support/software
For detailed and up to date information about the various systems and advice on which one to use see:
http://www.westgrid.ca/resources_services
4.3 Setting up Your Computer To connect to and work with WestGrid systems you may have to install one or
more software packages on your own computer. Although web browser-‐based tools may become available for accessing WestGrid in the future, especially as grid services are developed, most users will continue to log in and work directly on remote systems for some time to come.
4.3.1 Terminal client supporting SSH
The most important piece of software you will need is a terminal (client) program that supports the secure shell (SSH) protocol for network communications to remote servers. Linux and Mac OS X users can typically use the built-‐in terminal programs, whereas Microsoft Windows users often install an additional SSH client, such as PuTTY. PuTTY can be obtained from
http://www.chiark.greenend.org.uk/~sgtatham/putty/
There is an extensive list of SSH clients at:
http://en.wikipedia.org/wiki/Comparison_of_SSH_clients
4.3.2 File transfer client supporting scp and sftp
You will also need software that supports secure transfer of files between your computer and the WestGrid machines. The command line programs scp and sftp can be used from within terminal programs on Linux or Mac OS X computers. On Microsoft Windows platforms, similar programs, pscp and psftp come with PuTTY.
13
4.3.3 X Window display server for graphics
To use graphical programs on WestGrid computers and show the results on your monitor, you will need to run an X Window display server (X server) program on your local computer. You start up such a program and leave it running in the background while using your SSH terminal program. When graphics commands are relayed by your SSH client from the remote WestGrid computer to the X Window display server, it will display the appropriate graphics on your screen. Your keyboard and mouse commands can be relayed in the other direction and passed from your SSH client to the graphics program running on the remote system. The process is called X11 tunnelling or forwarding. For this to work, you should look for an option in the settings or preferences of your SSH client program to turn on X11 tunnelling.
Commercial X Window display servers are available, but, most users can get by with free programs. Linux users will find the X Window support already installed with most distributions. Modern versions of Mac OS X ship with a program called X11, which is not installed by default but is on the system disks. One option for Microsoft Windows users is to install Xming. If installing Xming, you should also install the optional font package.
4.4 Connecting and Logging In To successfully connect to WestGrid systems, your computer's IP address
must be correctly registered in the Domain Name System (DNS). To test whether your IP address is suitable, visit:
http://westgrid.ca/iptest
To connect to a WestGrid system, start your SSH client and specify the host name of the chosen system and your user name in the connection dialogue box or on the SSH command line, depending on what type of SSH program you are using. Each WestGrid machine to which you can connect has an Internet address of the form machine_name.westgrid.ca. So, for example, to connect to Matrix from a command-‐line SSH program, you could type:
ssh your_username@matrix.westgrid.ca
If your user name on your local system is the same as on WestGrid, you may omit it and simply type:
ssh matrix.westgrid.ca
To start a session with X11 forwarding turned on, one can typically use
ssh -X matrix.westgrid.ca
although from Mac OS X systems, you may have to use
ssh -Y matrix.westgrid.ca
14
If you have successfully connected to one of the WestGrid login servers, you will be prompted for a user name and password. The user name is not your full name, nor your email address, but, is the 2-‐ to 8-‐character name that was entered in the "Requested Username" box when you applied for a WestGrid account. The password to use is the one you specified on that form also. The same password is used for all WestGrid systems. For security, it is stored in an encrypted form. Consequently, if you have forgotten your password, WestGrid administrators will not be able to tell you what it is. Also for security reasons new passwords are not sent via email. Instead, you choose your own new password and enter it on a web form that is validated using a temporary password given to you by telephone. To request a new password, write to support@westgrid.ca and you will be given instructions on who to telephone.
4.5 Working Interactively The hardware at most of the WestGrid sites is set up with one or more servers
to which users have direct login access, with the main computational clusters being accessed indirectly, by submitting batch job scripts. The batch jobs run non-‐interactively when the scheduling system is able to find a time slot with the computational resources needed for the job. However, interactive sessions are typically needed to prepare the batch scripts and input files, compile and debug programs, manage data and post-‐process results. Some guidelines for working interactively are given in this section.
4.5.1 The UNIX environment
Each of the WestGrid computers runs some version of the UNIX (or Linux) operating system. The program that responds to your typed commands and allows you to run other programs is called the UNIX shell. Examples of a UNIX shell are bash and tcsh. It is useful to have some knowledge of the shell and a variety of other command-‐line programs that you can use to manipulate files. If you are new to UNIX systems, we recommend that you work through one of the many online tutorials that are available, such as the UNIX Tutorial for Beginners provided by the University of Surrey:
http://www.ee.surrey.ac.uk/Teaching/Unix/index.html
The tutorial covers such fundamental topics, among others, as creating, renaming and deleting files and directories, how to produce a listing of your files and how to tell how much disk space you are using.
The UNIX man command (man for "manual") can be used to get information about other commands. For example, a reference page about the ls command, for listing file names and properties, can be displayed by typing:
man ls
The default environment varies from one WestGrid system to another and also depends on which UNIX shell you selected on your WestGrid account
15
application form. The working environment is partially determined by the commands in one or more startup files that are automatically executed every time you log in. For bash shell users, these files may include .bashrc and .bash_profile. For tcsh users, .login and .cshrc are executed. You can customize your environment by editing these files to change such things as the appearance of the shell prompt (the characters that appear at the start of the line when the shell is waiting for you to type a command) and the command path (a list of directories in which the shell will search for commands). Use caution when modifying these files, as inappropriate changes may prevent you from being able to work on the system.
Please note that binary executable files from Microsoft Windows PCs will not run on the WestGrid systems. In order to work with such programs, you must obtain the source code and recompile for use on UNIX or Linux. Not all programs will have such source code available.
4.5.2 File systems
As on other computer systems, in a UNIX environment there is a file system that provides a hierarchy of directories (called folders on some other systems) for storing files. When you log in, you are working in part of the file system called your home directory. You may create files and subdirectories in your home directory, although on some WestGrid systems there is a quota limiting the amount of space you can use. How you organize your files is up to you, but, it might be helpful to create a separate subdirectory for each job that you submit and to have a separate directory for program source code.
When naming files and directories, you will find it easier to navigate the file hierarchy and to reference files in UNIX commands if you do not use spaces in file names. Also, keep in mind that UNIX is case sensitive in most situations, so, for example, Nobel_Prize.exe and nobel_prize.exe refer to different files. Another difference between UNIX and Microsoft Windows environments is that a file suffix, if present, is of no particular significance to the basic UNIX file manipulation commands. So, for example, there is no requirement for executable programs to have an ".exe" suffix.
Besides your home directory, on most of the WestGrid systems there are additional places (/tmp, /scratch and /global/scratch among others) where you can store files and from which you can run programs. Some file systems have more space than others. Sometimes there are performance reasons for choosing one location vs. another. There may also be different usage policies (how long you can keep files and how big they can be) for the various file systems.
4.5.3 Transferring files
When just starting out on WestGrid systems, you will likely have source code or data to be transferred from your own computer or one at your own institution. Similar to the requirement for a terminal program supporting SSH (Secure Shell),
16
WestGrid requires that you use file transfer SSH that supports SCP (Secure Copy) or SFTP (SSH File Transfer Protocol). Most ssh packages come with additional programs to support these secure file transfer methods.
Once you have files on a WestGrid system, you may move them between directories using the UNIX mv command, or to other WestGrid sites using scp or sftp. We also provide a utility called gcp (grid copy) that efficiently transfers files between WestGrid systems.
For long term storage of large files, consider using the WestGrid storage facility.
One thing to be aware of when transferring files is that there are different conventions for the characters that terminate each line in a text file on UNIX/Linux, Microsoft Windows and Macintosh computers. File transfer software typically has a transfer mode in which line-‐ending conversion is done automatically. For example, in Microsoft Windows-‐based programs, files that have a .txt suffix would be treated as text files for which conversion would likely be done, but, C or Fortran source code files having names ending in .c or .f, respectively, might not be recognized as text. You may have to configure your file transfer software to correctly handle files that you commonly use.
4.5.4 Editing files
One choice for creating and editing files, to prepare batch scripts or input for your programs, for example, is to transfer files to your own computer to use a local editor with which you are familiar. However, a better choice for most users is to edit the files directly on the WestGrid system on which they will be used. There are several editors available for you to use, as shown on the software page. Two editors commonly used on UNIX systems are emacs and vi. However, if you are coming from a Microsoft Windows background and have set up your computer with X Windows software, as described above, then, you may prefer the nedit editor. This is a graphical editor, with keyboard shortcuts similar to what would be found on PCs. See the next section for comments about running nedit and other interactive programs.
There are also a number of UNIX commands available for looking at the contents of files. For example, to page through an output file, test.pbs.o31416, the more command can be used:
more test.pbs.o31416
4.5.5 Running interactive programs
To run a program on a UNIX system, type the name of the corresponding executable file on the command line at the shell prompt. The UNIX shell searches for the command only in the directories in a list stored in a variable, PATH. You can see this list by typing:
17
echo $PATH
If you get a "command not found" error, check for a spelling mistake or a letter typed in the wrong case, or confirm that the directory containing the executable file is in your command path. On some WestGrid systems, the current working directory is not part of the default command path. In such a case, you can either change the PATH or type "./" in front of the command, as in:
./my_command
Many programs (including UNIX commands), take additional arguments, such as numerical parameters or file names, which are listed on the command line after the name of the executable program. Often the command-‐line arguments are preceded by a dash. For example, to list the last 40 lines of the file, geometry.in, you could use the UNIX tail command:
tail -40 geometry.in
4.5.6 Restrictions on interactive jobs
Since the servers to which you log in are shared by many users, interactive work on those machines should be limited to activities such as editing files, compiling programs or running small, short, tests of your program. The memory and number of processors varies among the login servers, so, the exact policy on the length and size allowed for test runs varies from machine to machine. On some systems there are special queues with short time limits that are intended for batch jobs for testing and debugging. It is also possible to submit a placeholder batch job to reserve one or more dedicated processors, which may then be used for interactive work, without interfering with other users' jobs.
4.6 Software
4.6.1 Locating installed software
Installed software on WestGrid systems includes the UNIX or Linux operating system and a number of standard utilities that often come with such systems. A number of major commercial and free software packages are also available, as well as compilers and a variety of numerical, graphics and file-‐manipulation libraries for researchers compiling their own codes. Refer to the main WestGrid software page for details on which packages have been installed on each of the main computational systems:
http://www.westgrid.ca/support/software
The installation directories have not been standardized, so, please refer to the table at the top of the software page for a list of the directories where software is typically installed on each system.
18
4.6.2 Installing your own software
You are welcome to install software under your home directory (if the software license allows the software to be used on remote machines that are not under your direct control and which may not be at your home institution). If you need to share a software package with other members of your group, a corresponding UNIX group can be created to control access to the software. Write to support@westgrid.ca for details on how to do this.
See the programming section below for information on compiling your code.
4.6.3 Requesting software installation
If a package was installed for testing or at the request of a limited number of researchers, it may not be listed on the software page. So, if there is a package that you need, there is a chance that it has already been installed, but, not announced. In any case, please write to support@westgrid.ca to ask whether a given software package is available or can be installed.
4.6.4 Software licensing
Although WestGrid has purchased some commercial software, such as the Gaussian chemistry code, there are other packages, such as ABAQUS and MATLAB being run on WestGrid systems using licenses provided by WestGrid institutions, rather than WestGrid itself. There are often limitations on such licenses, in terms of where the software may be run and how many simultaneous copies may be used.
4.7 Programming A general introduction to programming on WestGrid systems is available at:
http://www.westgrid.ca/support/programming
That webpage includes links to such things as parallel programming tutorials and to a series of pages giving examples of using the main compilers on all the WestGrid systems. Lists of all the compilers and the numerical (and other) libraries are available at:
http://www.westgrid.ca/support/software
If you have used non-‐standard language features in your code you may need to make some changes in order to get it to run on WestGrid systems. Trying your code with more than one compiler is recommended, as this helps identify non-‐portable sections of your code that should be improved. Contact support staff (see page 23 for contact information) if you would like help with porting, debugging or optimizing your code.
Sometimes researchers have chosen to use WestGrid because they want to increase the size of the problem being studied. Running the code on larger data sets can sometimes uncover performance issues or memory access problems. If
19
the code was previously run only on a 32-‐bit system, moving to a 64-‐bit environment may require changes if inappropriate assumptions were made regarding the size of some data types, for example.
Another issue that arises when tackling larger problems is the length of time required for the calculation. Some WestGrid systems have job time limits as short as one day. It is recommended that you design your program to include a checkpoint and restart capability. That is, you should periodically write out enough data so that your program can be restarted, if necessary, by reading in that data. That way you can avoid losing the entire calculation if the program doesn't finish before the job time limit is reached.
4.8 Running Batch Jobs
4.8.1 The batch environment
As mentioned above, the main WestGrid computational clusters are accessed by submitting batch job scripts from a login server. It is usually not necessary (and in some cases not allowed) to log on to the compute nodes directly. The system software that handles your batch job consists of two pieces: a resource manager (TORQUE) and a scheduler (Moab). Documentation for these packages is available through Cluster Resources. However, typical users will not need to study those details.
4.8.2 Batch job scripts
Batch job scripts are UNIX shell scripts (basically text files of commands for the UNIX shell to interpret, similar to what you could execute by typing directly at a keyboard) containing special comment lines that contain TORQUE directives. TORQUE evolved from software called PBS (Portable Batch System). Consequences of that history are that the TORQUE directive lines begin with #PBS, some environment variables contain "PBS" (such as $PBS_O_WORKDIR in the script below) and the script files themselves typically have a .pbs suffix (although that is not required).
There are small, but, significant differences in the batch job scripts, particularly for parallel jobs, among the various WestGrid systems. Examples for each system, for both serial and parallel jobs are given on the WestGrid website. So, if you begin working on one WestGrid system and switch to another, refer to the documentation before submitting jobs on the second system.
Here is an example job script, diffuse.pbs, for a serial job on the Glacier cluster, to run a program named diffuse.
#!/bin/bash #PBS -S /bin/bash # Script for running serial program, diffuse, # on glacier
20
cd $PBS_O_WORKDIR echo "Current working directory is `pwd`" echo "Starting run at: `date`" ./diffuse echo "Job finished with exit code $? at: `date`"
4.8.3 Commands for submitting, monitoring and deleting jobs
To submit the script, diffuse.pbs, to the batch job handling system, use the qsub command:
qsub diffuse.pbs
If a job is expected to take longer than the default time limit (typically three hours) or uses more than the default memory, additional arguments may be added to the qsub command line. If diffuse is a parallel program, you also have to specify the number of nodes on which it is to run. For example:
qsub -l walltime=72:00:00,mem=1500mb,nodes=4 diffuse.pbs
Please see the Running Jobs pages or QuickStart guides for the individual WestGrid systems for more information about the walltime, memory and node limits for specific machines.
When qsub processes the job, it assigns it a job ID and places the job in a queue to await execution. To check on the status of all the jobs on the system, type:
showq
To limit the listing to show just the jobs associated with your user name, type:
showq -u username
To delete a job, use the qdel command with the jobid assigned from qsub:
qdel jobid
On some WestGrid systems it is difficult to directly monitor some aspects of a job's progress, so, it is a good idea to make sure that your program periodically writes output to a file. You can then check the contents of that file to see how the program is doing. In other cases, such as when you need to confirm how much memory your job is using, you may have to write to support@westgrid.ca to request that an administrator check on the job for you.
21
4.9 Post-‐Processing After having completed some calculations on the WestGrid machines, most
researchers will need to post-‐process some output files.
4.9.1 Managing files
In some cases, after a preliminary examination of the output, there be a way to reduce the volume of data by extracting key numbers and then discarding some of the output. The UNIX grep utility may be helpful in simple cases. A more elaborate process using shell scripts or other programs may be needed. Once the data have been consolidated, files should be backed up, either by transferring them to your own computer or by using the WestGrid storage facility, as mentioned in the section on transferring files, above. If you have a large number of small files, you should consider combining and compressing them with the tar and gzip programs.
4.9.2 Visualization
For most types of calculation, graphical display of the output can be useful for identifying bugs in programs, to help interpret the data and to summarize the results for others. WestGrid has hardware and software at one site specially geared toward remote visualization, however, it is possible to use visualization tools on any of the WestGrid systems. Graphical data analysis needs tend to be quite specific, so, you are encouraged to discuss your particular project with WestGrid support analysts. In some cases it may be feasible to produce graphs or images in batch mode and in other cases, where more interactivity is required, we may recommend using the WestGrid visualization server or transferring the data back to your own computer for visualization there.
4.10 Usage Guidelines
4.10.1 Job limits
WestGrid is comprised of a wide range of hardware types, from single node large shared memory machines to clusters consisting of many dual-‐processor small-‐memory nodes. The maximum time limit allowed, the maximum number of processors that may be requested, the maximum number of jobs that can run simultaneously, etc. have been set by system adminstrators based on the characteristics of the machines and the role they play in the WestGrid environment. Generally speaking, jobs that request more resources (processors or memory) will have more strict limits than jobs that use less.
4.10.2 How much is reasonable?
In general, you may submit as many jobs as you like as the batch scheduling system will restrict the number that are run at any given time. However, so as not to unnecessarily burden the scheduling system or alarm other users, in most cases you should stage job submission so that you don't have many weeks of work waiting to run. It would be reasonable to submit some tens of jobs, for
22
example, if they last a few days each, or hundreds of jobs if they are only a few hours long. You should plan to monitor your runs regularly.
WestGrid users are also expected to take some responsibility for ensuring that their jobs are running efficiently, through the use of appropriate algorithms and compiler optimization options and linking to optimized libraries when possible. Programs should be tested on small problems before committing to longer runs using more resources. In general, parallel programs run more efficiently on smaller numbers of processors. So, study how the performance of your code depends on the number of processors used and balance the need for quick turnaround of your jobs with overall efficiency (that is, use small numbers of processors unless you have a good reason not to).
4.10.3 Job priorities and the fairshare policy Users are usually concerned that their jobs may not be progressing in the
queue relative to other users. There are a number of factors that affect the priority of the jobs waiting to run. The basic mechanism for determining the priority is called fairshare in which target usage amounts are assigned to each project. When considering which jobs to run, the scheduling software takes into account the past history (typically over a time span of a couple of weeks, with more recent usage weighted more heavily) and compares the amount of processing completed to the target. Priorities of the jobs are raised or lowered so as to try to meet the fairshare targets.
4.10.4 Resource Allocation Committee
In spite of the name, everyone's fair share is not the same. There is a mechanism for requesting enhanced priority if a project's needs for computational or storage resources extend beyond the average. Periodically, applications are solicited for awards from the WestGrid Resource Allocation Committee and the National Resource Allocation Committee for this privilege. More information about the resource allocation committees is available at:
http://www.westgrid.ca/support/accounts/rac
4.10.5 Accounting Project usage statistics are available for viewing by project members by
logging on to the WestGrid portal:
http://portal.westgrid.ca
23
5 More information
5.1 Getting Help WestGrid has a team of technical analysts available to assist researchers with
using the WestGrid resources. The analysts provide a wide range of services to researchers, for example:
• Assist researchers with getting started with HPC
• Provision of training courses and seminars
• Assistance with code development, debugging, optimization, porting and parallelization
• Assistance with code performance analysis
• Assistance with scientific visualization
• Data management advice
There is a single email address, support@westgrid.ca, which is read by all analysts at all the WestGrid sites. Use this address for questions/assistance related to any of the WestGrid facilities. University of Manitoba researchers and graduate students can also contact their local HPC analyst directly for consultation and help:
Jonatan Aronsson Office: E2-‐586 EITC Building, Fort Garry Campus Email: aronsson@cc.umanitoba.ca Phone: (204) 474-‐6912
There are a number of other resources that you can also use to get help with HPC and/or WestGrid:
• WestGrid Website The WestGrid web site (http://www.westgrid.ca) contains detailed information about how-‐to run jobs, compile codes, policies, etc specific to the WestGrid systems.
• Training Seminars During the fall and winter, WestGrid offers a series of seminars through video conferencing and, in some cases, by web streaming. Past topics have included an overview of WestGrid facilities, introduction to UNIX, serial and parallel (OpenMP and MPI) programming, submitting jobs and data visualization. See the WestGrid training page (http://www.westgrid.ca/support/training) for the schedule and list of topics in the next seminar series.
24
• Online Training There are numerous online tutorials on topics such as basic UNIX commands, shell scripting and parallel programming. Some of these are referenced on the corresponding WestGrid web pages or you can write to the support list mentioned above for recommendations on material covering specific topics.
5.2 Local WestGrid Contacts For support related inquires please see section 5.1.
Dr. Byron Southern WestGrid Principal Investigator souther@cc.umanitoba.ca
Dr. Peter Graham Representative to the WestGrid Senior Planning and User Needs Committees graham@cs.umanitoba.ca
Mr. David Wyatt WestGrid Technical Site Lead and System Administrator wyatt@cc.umanitoba.ca
Mr. Jonatan Aronsson HPC Applications Analyst and Collaboration/Visualization Coordinator aronsson@cc.umanitoba.ca