The NCIT Cluster Resources User’s Guidecluster.grid.pub.ro/images/clusterguide-v4.0.pdf ·...

Computer Science and Engineering Department

University Politehnica of Bucharest

The NCIT ClusterResources User’s Guide

Version 4.0

Emil-Ioan Slusanschi

Alexandru Herisanu

Razvan Dobre

2013

c©2013 Editura Paideia

Piata Unirii nr. 1, etaj 5, sector 3

Bucuresti, Romania

tel.: (031)425.34.42

e-mail: [email protected]

www.paideia.ro

www.cadourialese.ro

ISBN 978-973-596-909-7

Contents

1 Acknowledgements and History 6

2 Introduction 62.1 The Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2 Software Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.3 Further Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3 Hardware 83.1 Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83.2 Processor Datasheets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.2.1 The Intel Xeon Processors . . . . . . . . . . . . . . . . . . . . . . . . . 83.2.2 AMD Opteron Processors . . . . . . . . . . . . . . . . . . . . . . . . . 93.2.3 IBM Cell Broadband Engine Processors . . . . . . . . . . . . . . . . . . 9

3.3 Server Datashees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.3.1 IBM Blade Center H . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.3.2 HS 21 blade . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.3.3 HS 22 blade . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.3.4 LS 22 blade . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.3.5 QS 22 blade . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.3.6 Fujitsu Celsius R620 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.3.7 Fujitsu Esprimo Machines . . . . . . . . . . . . . . . . . . . . . . . . . 103.3.8 IBM eServer xSeries 336 . . . . . . . . . . . . . . . . . . . . . . . . . . 103.3.9 Fujitsu-SIEMENS PRIMERGY TX200 S3 . . . . . . . . . . . . . . . . 10

3.4 Storage System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.5 Network Connections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.5.1 Configuring VPN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.6 HPC Partner Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4 Operating systems 144.1 Linux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144.2 Addressing Modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

5 Environment 155.1 Login . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

5.1.1 X11 Tunneling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155.1.2 VNC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155.1.3 FreeNX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165.1.4 Running a GUI on your VirtualMachine . . . . . . . . . . . . . . . . . 16

5.2 File Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185.2.1 Tips and Tricks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185.2.2 Sharing Files Using Subversion / Trac . . . . . . . . . . . . . . . . . . 19

5.3 Module Package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215.4 Batch System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

5.4.1 Sun Grid Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235.4.2 Easy submit: MPRUN.sh . . . . . . . . . . . . . . . . . . . . . . . . . 245.4.3 Easy development: APPRUN.sh . . . . . . . . . . . . . . . . . . . . . . 265.4.4 Running a custom VM on the NCIT-Cluster . . . . . . . . . . . . . . . 26

2

6 The Software Stack 286.1 Compilers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

6.1.1 General Compiling and Linker hints . . . . . . . . . . . . . . . . . . . . 296.1.2 Programming Hints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306.1.3 GNU Compilers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316.1.4 GNU Make . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316.1.5 Sun Compilers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356.1.6 Intel Compilers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366.1.7 PGI Compiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

6.2 OpenMPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376.3 OpenMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

6.3.1 What does OpenMP stand for? . . . . . . . . . . . . . . . . . . . . . . 386.3.2 OpenMP Programming Model . . . . . . . . . . . . . . . . . . . . . . . 386.3.3 Environment Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 396.3.4 Directives format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406.3.5 The OpenMP Directives . . . . . . . . . . . . . . . . . . . . . . . . . . 446.3.6 Examples using OpenMP with C/C++ . . . . . . . . . . . . . . . . . . 456.3.7 Running OpenMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 486.3.8 OpenMP Debugging - C/C++ . . . . . . . . . . . . . . . . . . . . . . . 486.3.9 OpenMP Debugging - FORTRAN . . . . . . . . . . . . . . . . . . . . . 57

6.4 Debuggers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 616.4.1 Sun Studio Integrated Debugger . . . . . . . . . . . . . . . . . . . . . . 616.4.2 TotalView . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

7 Parallelization 697.1 Shared Memory Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

7.1.1 Automatic Shared Memory Parallelization of Loops . . . . . . . . . . . 707.1.2 GNU Compilers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 707.1.3 Intel Compilers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 717.1.4 PGI Compilers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

7.2 Message Passing with MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 727.2.1 OpenMPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 727.2.2 Intel MPI Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 73

7.3 Hybrid Parallelization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 737.3.1 Hybrid Parallelization with Intel-MPI . . . . . . . . . . . . . . . . . . 74

8 Performance / Runtime Analysis Tools 758.1 Sun Sampling Collector and Performance Analyzer . . . . . . . . . . . . . . . 75

8.1.1 Collecting experiment data . . . . . . . . . . . . . . . . . . . . . . . . . 758.1.2 Viewing the experiment results . . . . . . . . . . . . . . . . . . . . . . 76

8.2 Intel MPI benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 778.2.1 Installing and running IMB . . . . . . . . . . . . . . . . . . . . . . . . 778.2.2 Submitting a benchmark to a queue . . . . . . . . . . . . . . . . . . . . 78

8.3 Paraver and Extrae . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 808.3.1 Local deployment - Installing . . . . . . . . . . . . . . . . . . . . . . . 808.3.2 Deployment on NCIT cluster . . . . . . . . . . . . . . . . . . . . . . . 828.3.3 Installing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 828.3.4 Checking for Extrae installation . . . . . . . . . . . . . . . . . . . . . . 828.3.5 Visualization with Paraver . . . . . . . . . . . . . . . . . . . . . . . . . 838.3.6 Do it yourself tracing on the NCIT Cluster . . . . . . . . . . . . . . . . 83

3

8.3.7 Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 838.4 Scalasca . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

8.4.1 Installing Scalasca . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 848.4.2 Running experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

9 Application Software and Program Libraries 899.1 Automatically Tuned Linear Algebra Software (ATLAS) . . . . . . . . . . . . 89

9.1.1 Using ATLAS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 899.1.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

9.2 MKL - Intel Math Kernel Library . . . . . . . . . . . . . . . . . . . . . . . . . 919.2.1 Using MKL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 929.2.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

9.3 ATLAS vs MKL - level 1,2,3 functions . . . . . . . . . . . . . . . . . . . . . . 949.4 Scilab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

9.4.1 Sour code and compilation . . . . . . . . . . . . . . . . . . . . . . . . . 959.4.2 Using Scilab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 969.4.3 Basic elements of the language . . . . . . . . . . . . . . . . . . . . . . . 98

9.5 Deal II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 999.5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 999.5.2 Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1009.5.3 Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1009.5.4 Unpacking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1009.5.5 Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1009.5.6 Running Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

4

1 Acknowledgements and History

This is a document concerning the use of NCIT cluster resources. It was developed startingfrom the GridInitiative 2008 Summer School with continous additions over the years until July2013. Future versions are expected to follow, as new hardware and software upgrades will bemade to our computing center.

The authors and coordinators of this series would like to thank the following people thathave contributed to the information included in this guide: Sever Apostu, Alexandru Gavrila,Ruxandra Cioromela, Alexandra - Nicoleta Firica, Adrian Lascateu, Cristina Ilie, Catalin-Ionut Fratila, Vlad Spoiala, Alecsandru Patrascu, Diana Ionescu, Dumitrel Loghin, George-Danut Neagoe, Ana Maria Tuleiu, Raluca Silvia Negru, Stefan Nour, and many others.

v.1.0 Jul 2008 Initial releasev.1.1 Nov 2008 Added examples, reformatted LaTex codev.2.0 Jul 2009 Added chapters 7, 8 and 9, Updated chapters 3, 5 and 6v.3.0 Jul 2010 Added sections in chapters 5 and 6, Updated chapters 1-4v.3.1 Jul 2012 Updated chapters 1-9v.4.0 Jul 2013 Added sections 8.3, 8.4, 9.4, 9.5 Updated chapters 1-9

2 Introduction

The National Center for Information Technology NCIT of the University Politehnica ofBucharest started back in 2001 with the creation of CoLaborator labortory, a Research Basewith Multiple Users (R.B.M.U) for High Performance Computing (HPC), that benefited fromfunding by a World Bank project. CoLaborator was designed as a path of communicationbetween Universities, at a national level, using the national network infrastructure for educa-tion and research. Back in 2006, NCIT’s infrastructure was enlarged with the creation of asecond, more powerful, computing site, more commmonly referred to as the NCIT Cluster.

Both sites are used in research and teaching purposes by teachers, PhD students, gradstudents and students alike. Currently, a single sign-on (SSO) scenario is implemented, withthe same users’ credentials across sites in the entire computing infrastructure offered by ourcenter, using the already existing LDAP infrastructure behind the http://curs.cs.pub.ro

project.This document was created with the given purpose of serving as an introduction to the

parallel computing paradigm and as a guide to using the clusters’ specific resources. You willfind within the next paragraphs descriptions of the various existing hardware architectures,operating and programming environments, as well as further (external) information as thegiven subjects.

2.1 The Cluster

Although the two computing sites are different administrative entities and have differentphysical locations, the approach we use throughout this document will be that of a singlecluster with various platforms. This approach is also justified by the planned upgrade to a10Gb Ethernet link between the two sites.

Given the fact that the cluster was created over a rather large period of time, that newmachines are added continuosly, and that there is a real need for software testing on multipleplatforms, the structure of our computing center is a heterogenous one, in terms of hardware

6

platforms and operating/programming environments. More to the pont, there are currentlysix different computing architecture available on the cluster, namely:

• Intel Xeon Quad 64b

• Intel Xeon Nehalem 64b

• AMD Opteron 64b

• IBM Cell BE EDp 32/64b

• IBM Power7 64b

• NVidia Tesla M2070 32/64b

Not all of these platforms are currently given a frontend, the only frontends available atthe moment being fep.grid.pub.ro and gpu.grid.pub.ro machines (Front End Processors).The machines behind them are all running non-interactive, being available only to jobs sentthrough the frontends.

2.2 Software Overview

The tools used in the NCIT and CoLaborator cluster for software developing areSun Studio, OpenMP and OpenMPI. For debugging we use the TotalView Debugger andSun Studio and for profiling and performance analysis - Sun Studio Performance Tools andIntel VTune. Intel, Sun Studio and GNU compilers were used to compile our tools. Theinstallation of all the tools needed was done using the local repository of our computingcenter, available online at http://storage.grid.pub.ro.

2.3 Further Information

The latest version of this document will always be kept online at:https://cluster.grid.pub.ro/index.php/home

For any questions or feedback on this document, please feel free to contact us at:https://support.grid.pub.ro/

7

3 Hardware

This section covers in detail the different hardware architectures available in the cluster.

3.1 Configuration

The following table contains the list of all the nodes available for general use. There arealso various machines which are curently used for maintenance purposes. They will not bepresented in the list below.

Model Processor Type Sockets/Cores Memory HostnameIBM HS21 Intel Xeon 2/8 16 GByte quad-wn28 nodes E5405, 2 GHzIBM HS22 Intel Xeon 2/16 32 GByte nehalem-wn4 nodes E5630, 2.53 GHz

IBM LS22 AMD Opteron 2/12 16 GByte opteron-wn14 nodes 2435, 2.6 GHzIBM QS22 Cell BE Broadband 2/4 8 GByte cell-qs4 nodes 3.2GHz

IBM PS703 Power7 2/8 32 GByte power-wn8 nodes 2.4GHz

IBM iDataPlex NVidia Tesla 2/448 32 GByte dp-wndx360M34 nodes 1.15GHz 5GByte VRAM

Fujitsu Esprimo Intel P4 1/1 2 GByte p4-wn66 nodes 3 GHz

Fujitsu Celsius Intel Xeon 2/2 2 GByte dual-wn2 nodes 3 GHz

Additionally, our cluster has multiple partnerships and you can choose to run your code tothese remote sites using the same infrastructure and same user ids. What must be consideredis the different architecture of the remote cluster and the delay involved in moving your datathrough the VPN links involved.

3.2 Processor Datasheets

3.2.1 The Intel Xeon Processors

The Intel Xeon Processor refers to many families of Intel’s x86 multiprocessing CPUs fordual-processor (DP) and multi-processor (MP) configuration on a single motherboard targetedat non-consumer markets of server and workstation computers, and also at blade servers andembedded systems.The Xeon CPUs generally have more cache than their desktop counterpartsin addition to multiprocessing capabilities.

Our cluster is currently equipped with the Intel Xeon 5000 Processor sequence. Here is aquick list of the processor available in our cluster and their corresponding datasheets:

CPU Name Version Speed L2 Cache DatasheetIntel Xeon E5405 2 Ghz 12Mb click hereIntel Xeon E5630 2.53 GHz 12MB click hereIntel Xeon X5570 2.93 Ghz 12MB click hereIntel P4 - 3 Ghz - -

8

3.2.2 AMD Opteron Processors

The Six-Core AMD Opteron processor-based servers deliver performance efficiency to han-dle real world workloads with a good energy efficiency. There is one IBM Chassis with 14Inifiniband QDR connected Opteron blades available and the rest are used in our Hyper-Vvirtualization platform.

Our cluster is currently equipped with the Six-Core AMD Opteron Processor series. Clickon a link to see a corresponding datasheet.

CPU Name Version Speed L2 Cache DatasheetAMD Opteron 2435 2.6Ghz 6x512Kb click here

3.2.3 IBM Cell Broadband Engine Processors

The QS22, based on the new IBM PowerXCell 8i multicore processor, offers extraordi-nary single-precision and double-precision floating-point computing power to accelerate keyalgorithms, such as 3D rendering, compression, encryption, financial algorithms, and seismicprocessing. Our cluster is currently equipped with four Cells B.E.. Click on a link to see thedatasheet.

CPU Name Version Speed L2 Cache DatasheetIBM PowerXCell 8i QS22 3.2Ghz - CELL Arch / PowerPC Arch

3.3 Server Datashees

This section presents short descriptions of the server platforms that are used in our com-puting center.

3.3.1 IBM Blade Center H

There are seven chassis and each can fit 14 blades in 9U. You can find general informationabout the model here. Currently there are five types of blades installed: Intel based HS21and HS22 blades, AMD based LS22, IBM Cell based QS22, IBM Power7 PS703.

3.3.2 HS 21 blade

There are 32 H21 blades of which 28 are used for the batch system and 4 are for devel-opement and virtualization projects. Each blade has an Intel Xeon quad-core processor at2Ghz with 2x6MB L2 cache, 1333Mhz FSB and 16GB of memory. Full specifications can befound here. One can access these machines using the ibm-quad.q queue in SunGridEngineand their hostname is dual-wnXX.grid.pub.ro - 172.16.3.X

3.3.3 HS 22 blade

There are 14 H22 blades of which 4 are used for the batch system and 10 are dedicatedto the Hyper-V virtualization environment. Each blade has two Intel Xeon core processor at2.53Ghz with 12MB L2 cache, 1333Mhz FSB and 32GB of memory. Full specifications here.

Also, if one requires, 17 blades are available in the Hyper-V environment for HPC applica-tions using Microsoft Windows HPCC. The user is responsible to set-up the environment. AllHS22 blades have FibreChannel connected storage and use high-speed FibreChannel Disks.These disks are only connected on demand. What one gets by using the batch system is alocal disk. One can access these machines using the ibm-nehalem.q queue in SunGridEngineand their hostname is nehalem-wnXX.grid.pub.ro - 172.16.9.X

9

3.3.4 LS 22 blade

There are 20 LS22 blades of which 14 are available for batch system use, the rest can beused in the Hyper-V Environment. Each blade has an Opteron six-core processor at 2,6Ghz.Full specifications can be found here. One can access these machines using the ibm-opteron.qqueue in SunGridEngine and their hostname is opteron-wnXX.grid.pub.ro - 172.16.8.X

3.3.5 QS 22 blade

The Cell based QS22 blade features two dual core 3.2 GHz IBM PowerXCell 8i Processors,512 KB L2 cache per IBM PowerXCell 8i Processor, plus 256 KB of local store memory foreach eDP SPE. Their memory capacity is 8GB. They have no local storage, ergo they bootover the network. For more features QS 22 features. One can access these machines usingthe ibm-cell-qs22.q queue in SunGridEngine and their hostname is cell-qs22-X.grid.pub.ro- 172.16.6.X. One can also connect to these systems using a load-balanced connection atcell.grid.pub.ro (SSH).

3.3.6 Fujitsu Celsius R620

The Fujitsu-SIEMENS Celsius are workstations equipped with two Xeon processors. Be-cause of their high energy consumption, they were migrated beginning January 2011 in thetraining and in the pre-production lab. A couple of them are still accessible using the batchsystem, and they are used to host CUDA-capable graphics cards. One can access theesemachines using the fs-dual.q in SunGridEngine. Their hostname is dual-wnXX.grid.pub.ro -172.16.3.X

3.3.7 Fujitsu Esprimo Machines

There are curently 60 Fujitsu Esprimo, model P5905, available. They each have an IntelPentium 4 3.0Ghz CPU, with 2048KB L2 cache, 2048MB DDR2 man memory (upgradable to amaximum of 4GB) working at 533Mhz. Storage SATAII (300MB/s) 250 GB. More informationcan be found here. One can acces theese machines using the fs-p4.q in SunGridEngine. If onehas special projects requiring physical and dedicated access to the machines, this will be thequeue to use. Their corresponding hostname is p4-wnXXX.grid.pub.ro - 172.16.2.X.

3.3.8 IBM eServer xSeries 336

The IBM eServer xSeries 336 servers available at NCIT Cluster are 1U rack-mountablecorporate business servers, each with one Intel Xeon 3.0 GHz processor with Intel ExtendedMemory 64 Technology and upgrade possibility, Intel E7520 Chipset Type and a Data BusSpeed of 800MHz. They are equiped with 512MB DDR2 SDRAM ECC main memory workingat 400 MHz (upgradable to a maximum of 16GB), one Ultra320 SCSI integrated controller andone UltraATA 100 integrated IDE controller. They posses two network interfaces, Ethernet10Base-T/100Base-TX/1000BaseT (RJ-45). More information on the IBM eServer xSeries336 can be found on IBM’s support site, here. Currently, these servers are part of the coresystem of the NCIT cluster and users do not have direct access to them.

3.3.9 Fujitsu-SIEMENS PRIMERGY TX200 S3

The Fujitsu-SIEMENS PRIMERGY TX200 S3 servers available at NCIT Cluster havetwo Intel Dual-Core Xeon 3.0Ghz each with Intel Extended Memory 64 Technology and up-

10

grade possibility, Intel 5000V Chipset Type and a Data Bus Speed of 1066MHz. These pro-cessors have 4096KB of L2 Cache, ECC.

They come with 1024MB DDR2 SDRAM ECC main memory, upgradable to a maximumof 16GB, 2-way interleaved, working at 400 MHz, one 8-port SAS variant controller, one Fast-IDE controller and a 6-port controller. They have two network interfaces, Ethernet 10Base-T/100Base-TX/1000BaseT(RJ-45). More information on the Fujitsu-SIEMENS PRIMERGYTX200 S3 can be found on Fujitsu-SIEMENS’s site, here. Currently, these servers are part ofthe core system of the NCIT cluster and you do not have direct access to them.

3.4 Storage System

The storage system is composed of the following DELL solutions: 2 PowerEgde 2900 and2 PowerEdge 2950 servers, and 4 PowerVault MD1000 Storage Arrays. There are four typesof disk systems you can use local disks, NFS, LustreFS and FibreChannel disks.

All home directories are NFS mounted. There are several reasons behind this approach:many profiling tools can not run over LustreFS because of its locking mechanism and second,if the cluster is shut down, the time to start the Lustre filesystem is much greater thanstarting NFS. The NFS partition is under: /export/home/ncit-cluster. Jobs with high I/O areforbidden on the NFS directories.

Each user also has access to a LustreFS directory. e.g. ˜alexandru.herisanu/LustreFS(symbolic link to /d02/home/ncit-cluster/prof/alexandru.herisanu). The AMD Opteron nodes(LS22 blades) are connected to the LustreFS Servers through Infinband, all the other nodesuse one of the 4 LNET Routers to mount the filesystem. There are curently 3 OST serversand 1 MDS node, available either on Infiniband or TCP. Last but not least, each job has adefault local scratch space available created by our batch system.

Type Where ObservationsNFS ˜HOME (/export/home/ncit-cluster) Do not use I/O jobs here

LustreFS ˜HOME/LustreFS (/d02/home)Local HDD /scratch/tmp Local on each node

Starting from December 2011, our virtualization platform has received an upgrade in theform of a new FibreChannel storage system - an IBM DS 3950 machine with a total capacity of12Tb. The computing center has additional licences available so if required by really intensiveI/O applications where LustreFS is not an option, we can map some harddisks to one of thenehalems to satisfy these requirements. The NFS Server is storage-2. Here follows a list ofthe ip’s of the storage servers.

11

Hostname Connected Switch Port IPbatch NCitSw10GW4948-48-1 Gi1/37 172.16.1.1

NCitSw10GW4948-48-1 Gi1/38 141.85.224.101NCitSw10GW4948-48-1 Gi1/39 N/A*

Infiniband Voltaire - N/A*storage NCitSw10GW4948-48-1 Gi1/2 141.85.224.10

NCitMgmtSw-2950-48-1 Fa0/4 172.16.1.10storage-2 NCitSw10GW4948-48-1 Gi1/13 172.16.1.20

NCitSw10GW4948-48-1 Gi1/14 141.85.224.103Infiniband Voltaire - 192.168.5.20

storage-3 NCitSw10GW4948-48-1 Gi1/15 172.16.1.30Infiniband Voltaire - 192.168.5.30

storage-4 NCitSw10GW4948-48-1 Gi1/3 172.16.1.40Infiniband Voltaire - 192.168.5.40

storage-5 NCitSw10GW4948-48-1 Gi1/4 172.16.1.60NCitSw10GW4948-48-1 Gi1/7 141.85.224.49

Infiniband Voltaire - 192.168.5.60

(*) This is the MDS for the Lustre system.

3.5 Network Connections

Our main worker node router is a debian based machine. (141.85.241.163, 172.16.1.7,192.168.6.1, 10.42.0.1 ). It also acts as a name-caching server. If you get a public ip directly,your routers are 141.85.241.1 and 141.85.224.1, depending on the Vlan.

DNS Servers: 141.85.241.15, 141.85.164.62Our IPv6 network is 2001:b30:800:f0::/54. To get the ipv6 address of a host just use

the following rule:

IPv4 address: 172.16.1.7 | -> IPv6 address: 2001:b30:800:f006:172:16:1:7/64

Vlan: 6 (Cluster Nodes) |

f0[06], because f0 is part of the network and 06 because the host is in Vlan6. 172:16:1:7 isthe IPv4 address of the host.

3.5.1 Configuring VPN

Sometimes one needs acces to resources that are not reachable from the internet. If onehas an NCIT AD account, then one can still connect through VPN in the training network.Please write us if you need a VPN account.

The VPN Server is: win2k3.grid.pub.ro (141.85.224.36). We do not route your traffic sodeselect the following option. Right click on the VPN Connection - Properties - Networking- TCP/IP v4 - Properties - Advances, deselect Use default gateway on remote network.

3.6 HPC Partner Clusters

One can also run jobs on partner clusters like the HPC ICF and CNMSI Virtual Cluster.The ICF cluster uses IBM BladeCenter Chassis of the same generation as our quad-wn nodesand the CNMSI Cluster uses a lot of virtualized machines with limited memory and limitedstorage space. Although you can use these systems in any project you like, please take noteof the networking and storage architecture involved.

12

The HPC Cluster of the Institute of Physical Chemistry (http://www.icf.ro) is nearlyidentical to ours. There are 65 HS21 blades available with dual quad-core Xeon processors.The home directories are nfs-mouted through autofs. There are five chassis with 13 bladeseach. Currently, your home directory is mounted over the VPN link so it is advisable that youstore your data locally using the scratch directory provided by SunGridEngine or a local NFStemporary storage. The use of the icf-hpc-quad.q queue is restricted. The ip range visiblefrom our cluster is 172.17.0.0/16 - quad-wnXX.hpc-icf.ro.

The second cluster is a collaboration between UPB and CNMSI (http://www.cnmsi.ro).They provide us with access to 120 virtual machines, each witch 10Gb of hard-drive and 1Gbof memory. The use of the cnmsi-virtual.q queue is also restricted. The ip range visible fromour cluster is 10.10.60.0/24 - cnmsi-wnXXX.grid.pub.ro.

13

4 Operating systems

There is only one operating system running in the NCIT Cluster and that is Linux. Thecluster is split into a HPC domain and a virtualization domain. If you need to run windowsapplications we can provide you with the necessary virtualized machines, documentation andhowtos, but you have to set it up yourself.

4.1 Linux

Linux is the UNIX-like operating system. It’s name comes from the Linux kernel, originallywritten in 1991 by Linus Torvalds. The system’s utilities and libraries usually come from theGNU operating system, announced in 1983 by Richard Stallman. The Linux release usedat the NCIT Cluster is a RHEL (Red Hat Enterprise Linux) clone called Scientific Linux,co-developed by Fermi National Accelerator Laboratory and the European Organization forNuclear Research (CERN).

The Linux kernel version we use is:

$ uname -r

2.6.32-279.2.1.el6.x86_64

whereas the distribution release:

$ cat /etc/issue

Scientific Linux release 6.2 (Carbon)

4.2 Addressing Modes

Linux supports 64bit addressing, thus programs can be compiled and linked either in 32or 64bit mode. This has no influence on the capacity or precision of floating point numbers (4or 8 byte real numbers), affecting only memory addressing, the usage of 32 or 64bit pointers.Obviously, programs requiring more than 4GB of memory have to use the 64bit addressingmode.

14

5 Environment

5.1 Login

Logging into UNIX-like systems is done through the secure shell (SSH). Since usually theSSH daemon is installed by default both on Unix and Linux systems. You can log into eachone of the cluster’s frontends from your local UNIX machine, using the ssh command:

$ ssh [email protected]



Usage example:

$ssh [email protected]

Logging into one of the frontends from Windows is done by using Putty.We provide three ways to connect to the cluster having a graphical environment. You can

use X11 Tunneling, VNC or FreeNX to run GUI apps.

5.1.1 X11 Tunneling

The simple way to get GUI access is to use ssh X11 Tunneling. This is also the slowestmethod.

$ssh -X [email protected]

$xclock

Depending on your local configuration it may be necessary to use the -Y flag to enablethe trusted forwarding of graphical programs. (Especially if you’re a MAC user). If you’rerunning Windows, you need to run a local X Server. A lightweight server ( 2Mb) is XMing.

To connect using Windows, run XMing, run Putty and select Connection - SSH - X11 -Enable X11 forwarding, and connect to fep.

5.1.2 VNC

Another method is to use VNC. Due to the fact that VNC is not encrypted we use SSH portforwarding just to be safe. The frontend runs a configuration named VNC Linux Terminal Services,meaning that if you connect on port 5900 you’ll get a VNC server with 1024x768 resolution,5901 is 800x600 and so on. You can not connect directly so you must use ssh.

ssh -L5900:localhost:5900 [email protected]

On the local computer:

vncviewer localhost

First line connects to fep and creates a tunnel from your host port 5900 to fep (local-host:5900) port 5900. On your computer use vncviewer to connect to localhost.

If you use windows, use RealVNC Viewer and Putty. First configure tunneling in putty.Run putty and select Connection - SSH - Tunnels. We want to create a tunnel from ourmachine, port 5900 to fep after we connect. So select Source port: 5900 and Destinationlocalhost:5900 and click Add. Connect to fep and then use RealVnc and connect to localhost.You should get this:

15

Select ICE VM from Sessions. There is no GNOME or KDE Installed.

5.1.3 FreeNX

FreeNX uses a propietary protocol over a secondary ssh connection. It is the most efficientremote desktop for linux by far, but requires a client to be installed. NX Client is used bothon Linux and Windows. After installing the client, run the NX Connection Wizard like in thesteps below.

(*) You must actually select Unix - Custom in step2, as we do not use either Gnome orKDE but IceWM.

A more thorough howto can be found here. Our configuration uses the default FreeNXclient keys.

You could also use our NXBuilder App to download, install and configure the connectionautomatically. Just point your browser to http://cluster.grid.pub.ro/nx and make sure youhave java installed.

5.1.4 Running a GUI on your VirtualMachine

If you wish you can run your own custom virtual machine on any machine you like butdepending on the Virtual domain used, you may not have inbound internet acces. You canuse ssh tunneling to access your machine and this is how to do it.

16

First of all, you must decide on one of the following methods you want to use: X11tunneling, VNC or FreeNX. If you use KVM, then you can also connect to the console of thevirtual machine directly.

KVM Console

When you run the virtual machine with KVM, you will specify the host or the queue whereyour machine runs. Using qstat or the output of apprun.sh get the machine that’s runningthe VM. Ex: opteron-wn01.grid.pub.ro, port 11. Connect to the machine that hosts your VMdirectly using vncviewer either through X11 tunnelling or port forwarding.

a. X11 tunneling

$ssh -X [email protected]

$vncviewer opteron-wn01.grid.pub.ro:11

b. SSH Tunneling (tunnel the vncport on the localhost)

$ssh -L5900:opteron-wn01.grid.pub.ro:5911 [email protected]

on the local host

$vncviewer localhost

You can use any of the methods described ealier to get a GUI on fep. (X11,VNC orFreeNX). Use this method if you do not know your ip.

X11 Tunneling / VNC / FreeNX

If you know your ip, jst install SSH/VNC or FreeNX on your VM, connect to fep andconnect with -X. For example if the ip of your machine is 10.42.8.1 :

a. X11 tunneling

$ ssh -X [email protected]

(fep)$ ssh -X [email protected]

(vm)$ xclock

b. SSH Tunneling (tunnel the remote ssh port on the localhost)

$ ssh -L22000:10.42.8.1:22 [email protected]

on the local host

$ ssh -P22000 -X root@localhost

All theese three methods rely on services you install on your machine. The best way isto port forward the remote port locally. In case of X11 and FreeNX you will tunnel the SSHport (22), in case of VNC, port 5900+.

For more information check theese howtos: http://wiki.centos.org/HowTos/VNC-Server(5. Remote login with vnc-ltsp-config) and http://wiki.centos.org/HowTos/FreeNX.

17

5.2 File Management

At the time of the writing of this section there were no quota limits on how much diskspace you can use. If you really need it, we can provide it. Every user of the cluster has a homedirectory on an NFS shared filesystem within the cluster and a LustreFS mounted directory.Your home directory is usually $HOME=/export/home/ncit-cluster/role/username.

5.2.1 Tips and Tricks

Here are some tips on how to manage your files:

SCP/WinSCP

Transfering files to the cluster from your local UNIX-like machine is done through thesecure copy command scp, e.g:

$ scp localfile [email protected]:~/

$ scp -r localdirectory [email protected]:~/

(use -r when transferring muliple files)

The default directory where scp copies the file is the home directory. If you want to specifya different path where to save the file, you should write the path after ”: For example:

$ scp localfile [email protected]:your/relative/path/to/home

$ scp localfile [email protected]:/your/absolute/path

Transfering files back from the cluster goes the same way:

$ scp [email protected]:path/to/file /path/to/destination/on/local/machine

If you use Windows, use WinSCP. This is a scp client for Windows that provides agraphical file manager for copying files to and from the cluster.

SSH-FS Fuse

This is actually the best method you can use if you plan to edit files locally and run themremotely. First install sshFs (most distributions have this package already). Carefull, onemust use the absolute path of ones home directory.

$ sshfs [email protected]:/export/

home/ncit-cluster/prof/alexandru.herisanu /mnt

This allows you to use for example eclipse locally and see your files as local. Because it’sa mounted file system, the transfer is transparent. (Don’t forget to unmount). To see whatyour full home directory path is do this:

[alexandru.herisanu@fep ~]$ echo $HOME

/export/home/ncit-cluster/prof/alexandru.herisanu

18

SSH Fish protocol

If you use Linux, you can use the Fish protocol. Install mc, F9 (Left) - Shell link -fep.grid.pub.ro. You now have a in the left pane of the file manager all your remote files.

The same protocol can be used from GNOME and KDE. From Gnome, Places - Connectto Server . . . . Select service type: SSH, Server: fep.grid.pub.ro, Port: 22, Folder: full pathof the home directory (/export. . . ), User Name: your username, Name for connection: NCITFep. After doing this you will have a desktop item that Nautilus will use to browse yourremote files.

Microsoft Windows Share

Your home directory is also exported as a samba share. Connect through VPN and browsestorage-2. You can also mount your home partition as a local disk. Contact us if you needthis feature enabled.

Wget

If you want to copy an archive from a web-link in your current directory do this: Use CopyLink Location (from your browser) and paste the link as parameter for wget. For example:

wget http://link/for/download

5.2.2 Sharing Files Using Subversion / Trac

The NCIT Cluster can host your project on its SVN and Trac Servers. Trac is an en-hanced wiki and issue tracking system for software development projects. Our SVN server ishttps://svn-batch.grid.pub.ro and the Trac system is here https://ncit-cluster.grid.pub.ro.

19

Apache Subversion, more commonly known as Subversion (command name svn) is a versioncontrol system. It is mostly used in software development projects, where a team of peoplemay alter the same files or folders. Changes are usually identified by a number (sometimes aletter) code, called the ”revision number”, ”revision level”, or simply ”revision”. For example,an initial set of files is ”revision 1”. When the first change is made, the resulting set is”revision 2”, and so on. Each revision is associated with a timestamp and the person makingthe change. Revisions can be compared, restred, and, with most types of files, merged.

First of all, a repository has to be created in order to host all the revisions. This is generallydone using the create command as shown below. Note that any machine can host this typeof repository, but in some cases such as our cluster you are required to have certain rights ofaccess in order to create one.

$ svnadmin create /path/to/repository

Afterwards, the other users are provided with an address which hosts their files. Every usermust install a svn version (e.g.: subversion-1.6.2) in order to have access to the svn commandsand utilities. It is recommended that all the users involved in the same project use the sameversion.

Before getting started setting the default editor for the svn log messages is a good idea.Choose whichever editor you see fit. In the example below I chose vi.

$ export SVN_EDITOR=vim

Here are a few basic commands you should master in order to be able to use svn properlyand efficiently:

- Import - this command is used only once, when file sources are added for the first time;this part has been previously referred to as adding “revision 1”.

$ svn import /path/to/files/on/local/machine

/SVN_address/New_directory_for_your_project

- Add - this command is used when you what to add a new file to the ones that arealready existent. Be careful though - this phase itself does not commit the changes. It mustbe followed by an explicit commit command.

$ svn add /path/to/new_file_to_add

- Commit - this command is used when adding a new file or when submitting the changesmade to one of the files. Before the commit, the changes are only visible to the user howmakes them and not to the entire team.

$ svn commit /path/to/file/on/local/machine -m "explicit message explaining your change

- checkout - this command is used when you want to retrieve the latest version of theproject and bring it to your local machine.

$ svn checkout /SVN_address/Directory_of_your_project /path/to/files/on/local/machine

- rm - this command is used when you want to delete an existing file from your project.This change is visible to all of your team members.

$ svn rm /address/to/file_to_be_deleted -m message explaining your action

20

- merge - this command is used when you want to merge two or more revisions. M and Nare the revision numbers you want to merge.

$ svn merge sourceURL1[@N] sourceURL2[@M] [WORKINGPATH]

OR:

$ svn merge -r M:HEAD /SVN_address/project_directory

/path/to/files/on/local/machine

The last variant merges the files from revision M with the last revision existent.- Update - this command is used when you want to update the version you have on your

local machine to the latest revision. It is also an easy way to merge your file with the changesmade by your team before you commit your own changes. Do not worry. Your changes willnot be lost. If by any chance, both you and the other members have modified the same linesin a file, a conflict will be signaled and you will be given the opportunity to choose the finalversion of that line.

$ svn update

For further information and examples check theese links SVN Redbook and SVN Tutorial.

5.3 Module Package

The Module package provides for the dynamic modification of the user’s environment.Initialization scripts can be loaded and unloaded to alter or set shell environemnt variablessuch as $PATH or $LD_LIBRARY_PATH, to choose for example a specific compiler version or usesoftware packages.

The advantage of the modules system is that environment changes can easily be undoneby unloading a module. Dependencies and conflicts can be easily controlled. If, say, you needmpi with gcc then you’ll just have to load both the gcc compiler and mpi-gcc modules. Themodule files will make all the necessary changes to the environment variables.

Note: The changes will remain active only for your current session. When you exit, theywill revert back to the initial settings.

For working with modules, the module command is used. The most important optionsare explained in the following. To get help about the module command you can either readthe manual page (man module), or type

$ module help

To get the list of available modules type

$ module avail

--------------------------- /opt/modules/modulefiles ---------------------------

apps/bullet-2.77 java/jdk1.6.0_23-32bit

apps/codesaturn-2.0.0RC1 java/jdk1.6.0_23-64bit

apps/gaussian03 mpi/Sun-HPC8.2.1c-gnu

apps/gulp-3.4 mpi/Sun-HPC8.2.1c-intel

apps/hrm mpi/Sun-HPC8.2.1c-pgi

apps/matlab mpi/Sun-HPC8.2.1c-sun

batch-system/sge-6.2u5 mpi/intelmpi-3.2.1_mpich

batch-system/sge-6.2u6 mpi/openmpi-1.3.2_gcc-4.1.2

21

blas/atlas-9.11_gcc mpi/openmpi-1.3.2_gcc-4.4.0

blas/atlas-9.11_sunstudio12.1 mpi/openmpi-1.3.2_pgi-7.0.7

cell/cell-sdk-3.1 mpi/openmpi-1.3.2_sunstudio12.1

compilers/gcc-4.1.2 mpi/openmpi-1.5.0_gcc-4.1.2


compilers/intel-11.0_083 oscar-modules/1.0.5(default)

compilers/pgi-7.0.7 tools/ParaView-3.8.1

compilers/sunstudio12.1 tools/ROOT-5.28.00

debuggers/totalview-8.4.1-7 tools/celestia-1.6.0

debuggers/totalview-8.6.2-2 tools/eclipse_helios-3.6.1

grid/gLite-UI-3.1.31-Prod tools/scalasca-1.3.2_gcc-4.1.2

An available module can be loaded with$ module load [module name] -> $ module load compilers/gcc-4.1.2

A module which has been loaded before but is no longer needed can be removed using$ module unload [module name]

If you want to use another version of a software (e.g. another compiler), we strongly recom-mend switching between modules.$ module switch [oldfile] [newfile]

This will unload all modules from bottom up to the oldle , unload the oldle , load thenewle and then reload all previously unload modules. Due to this procedure the order ofthe loaded modules is not changed and dependencies will be rechecked. Furthermore somemodules adjust their environment variables to match previous loaded modules.

You will get a list of loaded modules with$ module list

A short information about the software initialized by a module can be obtained by

$ module whatis [file]

e.g.: $ module whatis compilers/gcc-4.1.2

compilers/gcc-4.1.2 : Sets up the GCC 4.1.2 (RedHat 5.3) Environment.

You can add a directory with your own module files with$ module use path

Note : If you loaded module files in order to compile a program, you probably have to loadthe same module files before running that program. Otherwise some necessary libraries maynot be found at program start time. This is also true if using the batch system!

5.4 Batch System

A batch system controls the distribution of tasks (batch jobs) to the available machinesor resources. It ensures that the machines are not overbooked, to provide optimal programexecution. If no suitable machines have available resources, the batch job is queued and willbe executed as soon as there are resources available. Compute jobs that are expected to runfor a large period of time or use a lot of resources should use the batchsystem in order toreduce load on the frontend machines.

You may submit your jobs for execution on one of the available queues. Each of the queueshas an associated environment.

To display queues summary:

22

$ qstat -g c [-q queue]

CLUSTER QUEUE CQLOAD USED RES AVAIL TOTAL aoACDS cdsuE

--------------------------------------------------------------------------------

all.q 0.45 0 0 256 456 200 0

cnmsi-virtual.q -NA- 0 0 0 0 0 0

fs-p4.q -NA- 0 0 0 0 0 0

ibm-cell-qs22.q 0.00 0 0 0 16 0 16

ibm-nehalem.q 0.03 15 0 49 64 0 0

ibm-opteron.q 0.72 122 0 46 168 0 0

ibm-quad.q 0.37 90 0 134 224 0 0

5.4.1 Sun Grid Engine

To submit a job for execution over a cluster, you have two options: either specify thecommand directly, or provide a script that will be executed. This behavior is controlled bythe ”-b y—n” parameter as follows: ”y” means the command may be a binary or a script and”n” means it will be treated as a script. Some examples of submitting jobs (both binaries andscripts).

$ qsub -q [queue] -b y [executable] -> $ qsub -q queue_1 -b y /path/my_exec

$ qsub -pe [pe_name] [no_procs] -q [queue] -b n [script]

e.g: $ qsub -pe pe_1 4 -q queue_1 -b n my_script.sh

To watch the evolution of the submited job, use qstat. Running it without any argumentsshows information about the jobs submited by you alone.

To see the progress of all the jobs use the -f flag. You may also specify which queue jobsyou are interested in by using the -q [queue] parameter, e.g:

$ qstat [-f] [-q queue]

Typing ”watch qstat” will automatically run qstat every 2 sec. To exit type ”Ctrl-C”.In order to delete a job that was previously submitted invoke the qdel command, e.g:

$ qdel [-f] [-u user_list] [job_range_list]

where:-f - forces action for running jobs-u - users whose jobs will be removed. To delete all the jobs for all users use -u ”*”.

An example of submitting a job with SGE looks like that:

$ cat script.sh

#!/bin/bash

‘pwd‘/script.sh

$ chmod +x script.sh

$ qsub -q queue_1 script.sh (you may omit -b and it will behave like -b n)

To display the sumited jobs of all users( -u ”*”) or a specified user, use:

$ qstat [-q queue] [-u user]

To display extended information about some jobs, use:

23

$ qstat -t [-u user]

To print detailed information about one job, use:

$ qstat -j job_id

MPI Jobs need so called paralell environments. There are two MPI integration types:tight and loose. Tight integration means that sun grid engine takes care of running all MPIdaemons for you on each machine. Loose, means that you have to boot up and tear down theMPI ring yourself. The Openmpi libraries use a tight integration (you use mpirun directly),and the Intel MPI library uses a loose integration (you must use mpdboot.py, mpiexec andmpdallexit). Each configured PE has a different scheduling policy. To see a list of paralellenvironments type:

[alexandru.herisanu@fep ~]$ qconf -spl

make

openmpi

openmpi*1

sun-hpc

sun-hpc*1

qsub -q ibm-quad.q -pe openmpi 5 means you want 5 slots from the ibm-quad.q. Dependingon the type of job you want to run simple/smp/mpi/hybrid, you may need to know thescheduling type used on each paralell environment.

Basically -pe openmpi 5 will schedule all five slots on the same machine, and -pe openmpi*15 will schedule each mpi process on a different machine so you can use openmp or have a betterI/O throughput.

You can find a complete howto for Sun GridEngine here:http://wikis.sun.com/display/gridengine62u5/Home

5.4.2 Easy submit: MPRUN.sh

Mprun.sh is a helper script provided by us for an easier application profiling. When you testyour application, you want to run the same application using different environment settings,compiler settings and so on. The mprun.sh script lets you run and customize your applicationusing the command line.

$ mprun.sh -h

Usage: mprun.sh --job-name [job-name] --queue [queue-name] \

--pe [Paralell Environment Name] [Nr. of Slots] \

--modules [modules to load] --script [My script] \

--out-dir [log dir] --show-qsub --show-script \

--batch-job

Example:

mprun.sh --job-name MpiTest --queue ibm-opteron.q \

--pe openmpi*1 3 \

--modules "compilers/gcc-4.1.2:mpi/openmpi-1.5.1_gcc-4.1.2" \

--script exec_script.sh \

--show-qsub --show-script

24

=> exec_script.sh <=

# This is where you put what to run ...

mpirun -np $NSLOTS ./a.out

# End of script.

For example, you have an mpi program named a.out and you wish to test it using 1, 4 and8 mpi processes on different queues and different scheduling options. All you need is a runscript like:

mprun.sh --job-name MpiTest --queue ibm-opteron.q --pe openmpi 1 \


--script exec_script.sh --show-qsub --show-script

mprun.sh --job-name MpiTest --queue ibm-nehalem.q --pe openmpi 2 \



mprun.sh --job-name MpiTest --queue ibm-quad.q --pe openmpi*1 4 \



Using this procedure, you can run your program, modify it and run the program usingthe same conditions. A more advanced feature is using different compilers end environmentalvariables for example you can use this script to run your program either with a tight integratedOpenMPI integration or a loose Intel MPI one.

MY_COMPILER=‘gcc‘

mprun.sh --job-name MpiTest --queue ibm-nehalem.q --pe openmpi 2 \


--script exec_script.sh --show-qsub --show-script \

--additional-vars MY_COMPILER

MY_COMPILER=‘intel‘

mprun.sh --job-name MpiTest --queue ibm-quad.q --pe openmpi*1 4 \

--modules "compilers/intel-11.0_083:mpi/intelmpi-3.2.1_mpich" \

--script exec_script.sh --show-qsub --show-script \

--additional-vars MY_COMPILER

Your exec script.sh must reflect these changes. When you run the execution script, youwill also have access to the MY COMPILER variable.

# This is where you put what to run ...

if [[ $MY_COMPILER == ‘‘intel" ]]; then

cat $PE_HOSTFILE | cut -f 1 -d ’ ’ > hostfile

mpdboot --totalnum=$NSLOTS --file=hostfile --rsh=/usr/bin/ssh

mpdtrace -l

mpiexec -np $NSLOTS ./hello_world_intel

mpdallexit

else

mprun -np $NSLOTS ./a.out

fi

# End of script.

25

5.4.3 Easy development: APPRUN.sh

Another tool build ontop of the SunGridEngine capabilities is Apprun. You can use apprunto easily run programs in the batch system and export the display home. This is how it works:

You connect to fep.grid.pub.ro using a GUI-capable connection (see ). Use apprun.sheclipse for example to schedule a job that will run eclipse on an empty slot. The graphicaldisplay will be exported through fep back to you. For example:

$ apprun.sh eclipse

will run eclipse. Curent available programs are: eclipse and xterm - for using interactivejobs.

5.4.4 Running a custom VM on the NCIT-Cluster

The NCIT Cluster has two virtualization stategies available: short-term virtual machinesand long-term, internet connected VMs. The short-term Virtual Machines use KVM and theLustreFS Storage. They are meant for application scaling in cases where you need anothertype of operating system or root access. The long term VM domain is a Hyper-V R2 Cluster.

You must be part of the kvm-users group to run a KVM machine. Basically you use theSunGridEngine batch system to reserve resources (cpu and memory). This way the virtualmachines will not overlap with normal jobs. All virtual machines are hosted on LustreFS, soyou have a infiniband connection on the opteron nodes and 4x1Gb maximal total throughputif running on the other nodes.

This system is used for systems testing and scaling. You boot your machine once customizeit, you shut it down and you boot several copy-on-write instances back up again. Copy-On-Write means the main disks are read-only and all the modified data is written to instancefiles. If you wish to revert to the initial machine, just delete the instance files and you’reset. Additionally you can run as many instances you like without having to copy your mastermachine all over again.

The VM startup script also uses apprun.sh. For example:

26

##

## Master Copy (for the master copy)

#

#apprun.sh kvm --queue [email protected] --vmname ABDLab --cpu 2 \

--memory 2048M --hda db2-hda.qcow2 --status status.txt \

--mac 80:54:00:01:34:01 --vncport 10 --master

##

## Slave Copy (for the copy-on-write instances)

#

apprun.sh kvm --queue [email protected] --vmname ABDLab01 --cpu 2 \

--memory 2048M --hda db2-hda.qcow2 --status status.txt --mac 80:54:01:01:34:01 \

--vncport 11

apprun.sh kvm --queue [email protected] --vmname ABDLab02 --cpu 2 \

--memory 2048M --hda db2-hda.qcow2 --status status.txt --mac 80:54:02:02:34:02 \

--vncport 11

See http://cluster.grid.pub.ro for a complete howto.

27

6 The Software Stack

This section covers in detail the programming tools available on the cluster, includingcompilers, debuggers and profiling tools. On the Linux operating system the freely availableGNU compilers are the somewhat ”natural” choice. Code generated by the gcc C/C++compiler performs acceptably on the Opteron-based machines. Starting with version 4.2 ofthe gcc now offers support for shared memory parallelization with OpenMP. Code generatedby the old g77 Fortran compiler typically does not perform well. Since version 4 of the GNUcompiler suite a FORTRAN 95 compiler - gfortran - is available. Due to performance reasons,we recommend that Fortran programmers use the Intel or Sun compiler in 64-bit mode. Asthere is an almost unlimited number of possible combinations of compilers and libraries andalso the two addressing modes, 32- and 64-bit, we expect that there will be problems withincompatibilities, especially when mixing C++ compilers.

Here’s a shortlist of the software and middleware available on our cluster.

MPI

API Flavor ObservationsOpenMPI v.1.5.1 Openmpi 1.5 The default mpi setup. On the

Gcc 4.1.2 opteron nodes, it uses the(if needed we can infiniband network by default.compile it for pgi All TCP nodes use both ethernet, intel and sun) cards to transmit MPI messages.

OpenMPI v.1.5 Openmpi 1.5 No infiniband support compiledGcc 4.1.2

OpenMPI v.1.3 Openmpi 1.3 No infiniband support compiledGcc 4.1.2 and 4.4.0,Intel, PGI and SunCompiler supported

Environments:

mpi/openmpi-1.3.2-gcc-4.1.2


mpi/openmpi-1.3.2-pgi-7.0.7

mpi/openmpi-1.3.2-sunstudio12.1



One needs to load the compiler before the mpi environment. For example if you usempi/openmpi-1.3.2 pgi-7.0.7 then a pgi 7.0.7 module load is required.

Website:http://www.open-mpi.org/.SDK Reference:http://www.open-mpi.org/doc/v1.5/http://www.open-mpi.org/doc/v1.3/

API Flavor ObservationsSun-HPC8.2.1c Openmpi 1.4 Sun Cluster Tools 8.2.1c is

Gcc, Intel, PGI based on Openmp 1.4. Infinibandand Sun Compiler support is provided.

supported

28

Environments: mpi/Sun-HPC8.2.1c-gnu, mpi/Sun-HPC8.2.1c-intel, mpi/Sun-HPC8.2.1c-pgi, mpi/Sun-HPC8.2.1c-sun.

Website: http://www.oracle.com/us/products/tools/message-passing-toolkit-070499.htmlSDK Reference: http://download.oracle.com/docs/cd/E19708-01/821-1319-10/index.html

API Flavor ObservationsIntel MPI 3.2.1 MPICH v2 flavor Loose MPI implementation

For the Intel Compiler, there is no tight integration setup. You must build your own mpiring using mpdboot and mpiexec. Website: . SDK Reference: Sun Grid Engine uses a looseMPI integration for Intel MPI. You must start the mpd ring manually. For example:

cat $PE_HOSTFILE | cut -f 1 -d ’ ’ > hostfile

mpdboot --totalnum=$NSLOTS --file=hostfile --rsh=/usr/bin/ssh

mpdtrace -l

mpiexec -np $NSLOTS ./hello_world_intel

mpdallexit

When using a paralell environment, Sun Grid Engine exports the file containing the re-served nodes in the variable PEfill HOSTFILE. The first line, parses that file and rewrites itin MPICH format. The communication is done using ssh public key authentication.

Development software

The current compiler suite is composed of GCC 4.1.2, GCC 4.4.0, Sun Studio 12 U1, Intel11.0-83, PGI 7.0.7 and PGI 10. Additionaly on the CELL B.E platform there is a IBM XLcompiler for C and Fortran available, but currently unaccesible.

Supported java version are Sun Java 1.6 U23 and OpenJDK 1.6. We provide Matlab14 access by running it on VMWare Machines. The main profiler application is Sun StudioAnalyser and Collector and Intel VTUNE. Other MPI profiling tools available are Scalasca(and its viewer cube3). Current Math libraries available: Intel MKL, NAG (for C and Fortrancurrently unavailable) and ATLAS v.9.11 (compiled for gcc and sunstudio on the Xeon nodes).

Current debugging tools: TotalView 8.6.2, Eclipse Debugger (GDB), Valgrind We pro-vide acces to a remote desktop to use all GUI enabled applications. We have a dedicatedset of computers on witch you can run Eclipse, SunStudio, TotalView etc and export yourdisplay locally. We currently provide remote display capability through FreeNX, VNC or X11Forwarding.

6.1 Compilers

6.1.1 General Compiling and Linker hints

To access non-default compilers you have to load the appropriate module using moduleavail to see the availables modules and the module load to load the modul (see 5.3 ModulePackage). You can then access the compilers by their original name, e.g. g++, gcc, gfortran,or by environment variables $CXX, $CC or $FC. When, however, loading more than onecompiler module, you have to be aware that environment variables point to the compilerloaded at last.

For convenient switching between compilers and platforms, we added environment variablesfor the most important compiler flags. These variables can be used to write a generic makefilewhich compiles on all our Unix like platforms:

29

• FC, CC, CXX -a variable containing the appropiate compiler name.

• FLAGS DEBUG -enable debug information.

• FLAGS FAST - include the options which usually offer good performance.For manycompiler this will be the -fast option.

• FLAGS FAST NO FPOPT - like fast, but disallow any floating point optimizationswhich will have an impact on rounding errors.

• FLAGS ARCH32 - build 32 bit executables or libraries.

• FLAGS ARCH64 - build 64 bit executables or libraries.

• FLAGS AUTOPAR - enable autoparallelization, if the compiler supports it.

• FLAGS OPENMP - enable OpenMP support, if supported by the compiler.

To produce debugging information in the operating systems native format use -g option atcompile time. In order to be able to mix different compilers all these variables exist also withthe compiler name in the variable name, like GCC CXX or FLAGS GCC FAST.

$ $PSRC/pex/520|| $CXX $FLAGS_FAST $FLAGS_ARCH64 $FLAGS_OPENMP $PSRC/cpop/pi.cpp

In general we recommend to use the same ags for compiling and for linking. Otherwise theprogram may not run correctly or linking may fail. The order of the command line optionswhile compiling and linking does matter. If you get unresolved symbols while linking, thismay be caused by a wrong order of libraries. If a library xxx uses symbols out of the libraryyyy, the library yyy has to be right of xxx in the command line, e.g ld ... -lxxx -lyyy.

The search path for header les is extended with the -Idirectory option and the librarysearch path with the -Ldirectory option. The environment variable ld_library_path speciesthe search path where the program loader looks for shared libraries. Some compile time linker,e.g. the Sun linker, also use this variable while linking, while the GNU linker does not.

6.1.2 Programming Hints

Generally, when developing a program, one wants to make it tu run faster. In order toimprove the quality of a code, there are certain aspects that must be followed, in order tomake better use the available hardware resources:

1. Turn on compiler optimization. The use of $FLAGS FAST options which may be agood starting point. However keep in mind that optimization may change rounding errorsof floating point calculations. You may want to use the variables supplied by the compilermodules. An optimized program runs typically 3 to 10 times faster than the non-optimizedone.

2. Try another compiler. The ability of different compilers to generate efficient executablesvaries. The runtime differences are often between 10% to 30%.

3. Write efficient code, which can be optimized by the compiler. Look up for informationregarding the compiler you want to use on its documentation, both online and offline.

4. Access memory continously in order to reduce cache and TLB misses. This especiallyeffects multidimensional arrays and complex data structures.

5. Use optimized libraries, e.g. the Sun Performance Library on the ACML library.6. Use a profiling tool, like the Sun Collector and Analyzer, to find the compute or time

intensive parts of your program, since thsese are the parts where you want to start optimizing.7. Consider parallelization to reduce the runtime of your program.

30

6.1.3 GNU Compilers

The GNU C/C++/Fortran compilers are available by using the binaries gcc, g++, g77and gfortran or by environment variables $CXX, $CC or $FC. If you cannot access them youhave to load the appropiate module file as is described in section 5.3 Module Package. Forfurther references the manual pages are available. Some options to compile your program andincrease their performance are:

• -m32 or -m64, to produce code with 32-bit or 64-bit addresing - as mentioned above, thedefault is platform dependant

• -march=opteron, to optimize for the Pentium processor (NCIT Cluster)

• -mcpu=ultrasparc optimize for the Ultrasparc I/II processors (CoLaborator)

• -O2 or -O3, for different levels of optimization

• -malign-double, for Pentium specific optimization

• -ffast-math, for floating point optimizations

GNU Compilers with versions above 4.2 support OpenMP by default. Use the -fopenmpflag to enable the OpenMP support.

6.1.4 GNU Make

Make is a tool which allows the automation and hence the efficient execution of tasks. Inparticular, it is used to auto-compile programs. In order to obtain an executable from moresources it is inefficient to compile every file each time and link-edit them after that. GNUMake compiles every files separately and once one of them is changed, only the modified onewill be recompiled.

The tool Make uses a configuration file called Makefile. Such a file contains rules andcommands of automation. Here is a very simple Makefile example which helps clarify theMake syntax.

Makefile example1

all:

gcc -Wall hello.c -o hello

clean:

rm -f hello

For the execution of the example above the following commands are used:

$ make


$ ./hello

hello world!

The example presented before contains two rules: all and clean. When run, the makecommand performs the first rule written in the Makefile (in this case all - the name is notparticularly important).

The executed command is gcc - Wall hello.c -o hello. The user can choose explicitly whatrule will be performed by giving it as a parameter to the make command.

31

$ make clean

rm -f hello

$ make all


In the above example, the clean rule is used in order to delete the executable hello and themake all command to obtain the executable again.

It can be seen that no other arguments are passed to the make command to specify whatMakefile will be analyzed. By default, GNU Make searches, in order, for the following files:GNUmakefile, Makefile, makefile and analyzes them.

The syntax of a ruleHere is the syntax of a rule from a Makefile file:

target: prerequisites

<tab> command

* target is, usually, the file which will be obtained by performing the command ”command”.As we had seen from the previous examples, this can also be a virtual target, meaning that ithas no file associated with it.

* prerequisites represents the dependencies needed to follow the rule. These are usuallythe various files needed for the obtaining of the target.

* <tab>represents the tab character and it MUST, by all means, be used before specifyingthe command.

* command - a list of one or more commands which are executed when the target idobtained.

Here is another example of Makefile:Makefile example2

all: hello

hello: hello.o

gcc hello.o -o hello

hello.o: hello.c

gcc -Wall -c hello.c

clean:

rm -f *.o hello

Observation: The rule all is performed implicitly.* all has a hello dependency and executes no commands.* hello is dependent on hello.o; it makes the link-editing of the file hello.o.* hello.o has a hello.c dependency; it makes the compiling and assembling of the hello.c file.

In order to obtain the executable, the following commands are used:

$ make -f Makefile_example2

gcc -Wall -c hello.c


The use of the variablesA Makefile file allows the use of variables. Here is an example:Makefile example3

32

CC = gcc

CFLAGS = -Wall -g

all: hello

hello: hello.o

$(CC) $^ -o $@

hello.o: hello.c

$(CC) $(CFLAGS) -c $<

.PHONY: clean

clean:

rm -f *.o hello

In the example above, the variables CC and CFLAGS were defined. CC stands for thecompiler used, and CFLAGS for the options (flags) used for compiling. In this case, the optionsused show the warnings (-Wall) and compiling with debugging support (-g). The reference toa variable is done using the construction $(VAR NAME). Therefore, $(CC) is replaced withgcc, and $(CFLAGS) is replaced with -Wall -g.

There are also some predefined useful variables:* $@ expands to the name of the target;* $ˆ expands to the list of requests;* $< expands to the first request.

Ergo, the command $(CC) $ˆ -o $@ reads as:


and the command $(CC) $(CFLAGS) -c $< reads as:

gcc -Wall -g -c hello.c

The usage of implicit rulesMany times there is no need to specify the command that must be executed as it can be

detected implicitly.This way, in case the following rule is specified :

main.o: main.c

the implicit command $(CC) $(CFLAGS) -c -o $@ $< is used.Thus, the Makefile example2 shown before can be simplified, using implicit rules, like this:Makefile example4

CC = gcc

CFLAGS = -Wall -g

all: hello

hello: hello.o

hello.o: hello.c

.PHONY: clean

clean:

rm -f *.o *~ hello

A phony target is one that is not really the name of a file. It is just a name for somecommands to be executed when you make an explicit request. There are two reasons to use aphony target: to avoid a conflict with a file of the same name, and to improve performance.If you write a rule whose commands will not create the target file, the commands will beexecuted every time the target comes up for remaking. Here is an example:

33

clean:

rm *.o hello

Because the rm command does not create a file named ”clean”, probably no such filewill ever exist. Therefore, the rm command will be executed every time the ”make clean”command is run.

The phony target will cease to work if anything ever does create a file named ”clean” inthat directory. Since it has no dependencies, the file ”clean” would inevitably be consideredup to date, and its commands would not be executed. To avoid this problem, the explicitdeclaration of the target as phony, using the special target .PHONY is recommended.

.PHONY : clean

Once this is done, ”make clean” will run the commands regardless of whether there is afile named ”clean” or not. Since the compiler knows that phony targets do not name actualfiles that could be remade from other files, it skips the implicit rule search for phony targets. This is why declaring a target phony is good for performance, even if you are not worriedabout the actual file existing. Thus, you first write the line that states that clean is a phonytarget, then you write the rule, like this:

.PHONY: clean

clean:

rm *.o hello

It can be seen that in the Makefile example4 implicit rules are used. The Makefile can besimplified even more, like in the example below:

Makefile example5

CC = gcc

CFLAGS = -Wall -g

all: hello

hello: hello.o

.PHONY: clean

clean:

rm -f *.o hello

In the above example, the rule hello.o:hello.c was deleted. Make sees that there is no filehello.o and it looks for the file C from which it can obtained. In order to do that, it createsan implicit rule and compiles the file hello.c:

$ make -f Makefile.ex5

gcc -Wall -g -c -o hello.o hello.c


Generally, if we have only one source file, there is no need for a Makefile file to obtain thedesired executable.

$ls

hello.c

$ make hello

cc hello.c -o hello

34

Here is a complete example of a Makefile using gcc. Gcc ca be easily replaced with othercompilers. The structure of the Makefile remains the same. Using all the facilites discussed upto this point, we can write a complete example using gcc (the most commonly used compiler),in order to obtain the executables from both a client and a server file.

Files used:* the server executable depends on the C files server.c, sock.c, cli handler.c, log.c and on theheader files sock.h, cli handler.h, log.h;* the client executable depends on the C files client.c, sock.c, user.c, log.c and on the headerfiles sock.h, user.h, log.h.

The structure of the Makefile file is presented below:Makefile example6

CC = gcc # the used compiler

CFLAGS = -Wall -g # the compiling options

LDLIBS = -lefence # the linking options

#create the client and server executables

all: client server

#link the modules client.o user.o sock.o in the client executable

client: client.o user.o sock.o log.o

#link the modules server.o cli_handler.o sock.o in the server executable

server: server.o cli_handler.o sock.o log.o

#compile the file client.c in the object module client.o

client.o: client.c sock.h user.h log.h

#compile the file user.c in the object module user.o

user.o: user.c user.h

# compile the file sock.c in the module object sock.o

sock.o: sock.c sock.h

#compiles the file server.c in the object module server.o

server.o: server.c cli_handler.h sock.h log.h

#compile the file cli_handler.c in the object module cli_handler.o

cli_handler.o: cli_handler.c cli_handler.h

#compile the file log.c in the object module log.o

log.o: log.c log.h

.PHONY: clean

clean:

rm -fr *.o server client

6.1.5 Sun Compilers

We use Sun Studio 6 on the Solaris machines (soon to be upgraded) and Sun Studio 12 onthe Linux machines. Nevertheless, the use of these two versions of Sun Studio is pretty muchthe same.

The Sun Studio development tools include the Fortran95, C and C++ compilers. Thebest way to keep your applications free of bugs and at the actual performance level we recom-mend you to recompile your code with the latest production compiler. In order to check theversion that your are currently using use the flag -V.

The commands that invoke the compilers are cc, f77, f90, f95 and CC. An impor-tant aspect about the Fortran 77 compiler is that from Sun Studio 7 is no longer available.

35

Actually, the command f77 invokes a script that is a wrapper and it is used to pass thenecessary compatibility options, like -f77, to the f95 compiler. We recommend adding -f77-trap=common in order to revert to f95 settings for error trapping. At the link step youmay want to use the -xlang=f77 option(when linking to old f77 object binaries). Detailedinformation about compatibility issues between Fortran 77 and Fortran 95 can be found inhttp://docs.sun.com/source/816-2457/5_f77.html

For more information about the use of Sun Studio compilers you may use the man pages butyou may also use the documentation found at http://developers.sun.com/sunstudio/documentation.

6.1.6 Intel Compilers

Use the module command to load the Intel compilers into your environment. The curentversion of Intel Compiler is 11.1. The Intel C/C++ and Fortran77/Fortran90 compilers areinvoked by the commands icc — icpc — ifort on Linux. The corresponding manual pages areavailable for further information. Some options to increase the performance of the producedcode include:

• -O3 high optimization

• -fp-model fast=2 enable floating point optimization

• -openmp turn on OpenMP

• -parallel turn on auto-parallelization

In order to read or write big-endian binary data in Fortran programs, you can use the compileroption -convert big endian.

6.1.7 PGI Compiler

PGI compilers are a set of commercially available Fortran, C and C++ compilers for HighPerformance Computing Systems from Portland Group.

PGI Compiler:

• PGF95 - for fortran

• PGCC - for c

• PGC++ - for c++

PGI Recommended Default Flags:

• -fast A generally optimal set of options including global optimization, SIMD vectoriza-tion, loop unrolling and cache optimizations.

• -Mipa=fast,inline Aggressive inter-procedural analysis and optimization, including au-tomatic inlining.

• -Msmartalloc Use optimized memory allocation (Linux only).

• –zc eh Generate low-overhead exception regions.

PGI Tuning Flags

• -Mconcur Enable auto-parallelization; for use with multi-core or multi-processor targets.

36

• -mp Enable OpenMP; enable user inserted parallel programming directives and pragmas.

• -Mprefetch Control generation of prefetch instructions to improve memory performancein compute-intensive loops.

• -Msafeptr Ignore potential data dependencies between C/C++ pointers.

• -Mfprelaxed Relax floating point precision; trade accuracy for speed.

• -tp x64 Create a PGI Unified Binary which functions correctly on and is optimized forboth Intel and AMD processors.

• -Mpfi/-Mpfo Profile Feedback Optimization; requires two compilation passes and aninterim execution to generate a profile.

6.2 OpenMPI

RPM’s are available compiled both for 32and 64bit machines. It was compiled with bothSun Studio and GNU Compilers and the user may select which one to use depending on thetask.

The compilers provided by OpenMPI are mpicc, mpiCC, mpic++, mpicxx, mpif77 andmpif90. Please note that mpiCC, mpic++ and mpicxx all invoke the same C++ compilerwith the same options. Another aspect is that all of these commands are only wrappers thatactually call opal wrapper. Using the -show flag does not invoke the compiler, instead it printsthe command that would be executed. To find out all the possible flags these commands mayreceive, use the -flags flag.

To compile your program with mpicc, use:

$ mpicc -c pr.c

To link your compiled program, use:

$ mpicc -o pr pr.o

To compile and link all at once, use:

$ mpicc -o pr pr.c

For the others compilers the commands are the same - you only have to replace the com-piler’s name with the proper one.

The mpirun command executes a program, like

$ mpirun [options] <program> [<args>]

The most used option specifies the number of cores to run the job: -n #. It is not necessaryto specify the hosts on which the job would execute because this will be managed by Sun GridEngine.

37

6.3 OpenMP

OpenMP is an Application Program Interface (API), jointly defined by a group of majorcomputer hardware and software vendors. OpenMP provides a portable, scalable model fordevelopers of shared memory parallel applications. The API supports C/C++ and Fortran onmultiple architectures, including UNIX and Windows NT. This tutorial covers most of the ma-jor features of OpenMP, including its various constructs and directives for specifying parallelregions, work sharing, synchronisation and data environment. Runtime library functions andenvironment variables are also covered. This tutorial includes both C and Fortran examplecodes and an exercise.

An Application Program Interface (API) that may be used to explicitly direct multi-threaded, shared memory parallelism is comprised of three primary API components:

• Compiler Directives

• Runtime Library Routines

• Environment Variables

The API is specified for C/C++ and Fortran and most major platforms have been im-plemented including Unix/Linux platforms and Windows NT, thus making it portable. Itis standardised: jointly defined and endorsed by a group of major computer hardware andsoftware vendors and it is expected to become an ANSI standard.

6.3.1 What does OpenMP stand for?

Short answer: Open Multi-ProcessingLong answer: Open specifications for Multi-Processing via collaborative work between

interested parties from the hardware and software industry, government and academia.OpenMP is not meant for distributed memory parallel systems (by itself) and it is not

necessarily implemented identically by all vendors. It doesn’t guarantee to make the mostefficient use of shared memory and it doesn’t require to check for data dependencies, dataconflicts, race conditions or deadlocks. It doesn’t require to check for code sequences thatcause a program to be classified as non-conforming. It is also not meant to cover compiler-generated automatic parallel processing and directives to the compiler to assist it and thedesign won’t guarantee that input or output to the same file is synchronous when executed inparallel. The programmer is responsible for the synchronising part.

6.3.2 OpenMP Programming Model

OpenMP is based upon the existence of multiple threads in the shared memory program-ming paradigm. A shared memory process consists of multiple threads.

OpenMP is an explicit (not automatic) programming model, offering the programmer fullcontrol over the parallel processing. OpenMP uses the fork-join model of parallel execution.All OpenMP programs begin as a single process: the master thread. The master thread runssequentially until the first parallel region construct is encountered.

FORK: the master thread then creates a team of parallel threads. The statements inthe program that are enclosed by the parallel region construct are then executed in parallelamongst the various team threads.

JOIN: When the team threads complete, they synchronise and terminate, leaving only themaster thread.

38

Most OpenMP parallelism is specified through the use of compiler directives which areembedded in C/C++ or Fortran source code. Nested Parallelism Support: the API providesfor the placement of parallel constructs inside of other parallel constructs. Implementationsmay or may not support this feature.

Also, the API provides for dynamically altering the number of threads which may be usedto execute different parallel regions. Implementations may or may not support this feature.

OpenMP specifies nothing about parallel I/O. This is particularly important if multiplethreads attempt to write/read from the same file. If every thread conducts I/O to a differentfile, the issue is not significant. It is entirely up to the programmer to ensure that I/O isconducted correctly within the context of a multi-threaded program.

OpenMP provides a ”relaxed-consistency” and ”temporary” view of thread memory, asthe producers claim. In other words, threads can ”cache” their data and are not requiredto maintain exact consistency with real memory all of the time. When it is critical that allthreads view a shared variable identically, the programmer is responsible for ensuring that thevariable is FLUSHed by all threads as needed.

6.3.3 Environment Variables

OpenMP provides the following environment variables for controlling the execution of par-allel code. All environment variable names are uppercase. The values assigned to themare not case sensitive.OMP SCHEDULEApplies only to DO, PARALLEL DO (Fortran) and for, parallel for C/C++ directives whichhave their schedule clause set to RUNTIME. The value of this variable determines how itera-tions of the loop are scheduled on processors. For example:

setenv OMP_SCHEDULE "guided, 4"

setenv OMP_SCHEDULE "dynamic"

OMP NUM THREADSSets the maximum number of threads to use during execution. For example:

setenv OMP_NUM_THREADS 8

OMP DYNAMICEnables or disables dynamic adjustment of the number of threads available for execution ofparallel regions. Valid values are TRUE or FALSE. For example:

setenv OMP_DYNAMIC TRUE

OMP NESTEDEnables or disables nested parallelism. Valid values are TRUE or FALSE. For example:

setenv OMP_NESTED TRUE

Implementation notes:Your implementation may or may not support nested parallelism and/or dynamic threads.If nested parallelism is supported, it is often only nominal, meaning that a nested parallelregion may only have one thread. Consult your implementation’s documentation for details -or experiment and find out for yourself.OMP STACKSIZENew feature available with OpenMP 3.0. Controls the size of the stack for created (non-Master) threads. Examples:

39

setenv OMP_STACKSIZE 2000500B

setenv OMP_STACKSIZE "3000 k "

setenv OMP_STACKSIZE 10M

setenv OMP_STACKSIZE " 10 M "

setenv OMP_STACKSIZE "20 m "

setenv OMP_STACKSIZE " 1G"

setenv OMP_STACKSIZE 20000

OMP WAIT POLICYNew feature available with OpenMP 3.0. Provides a hint to an OpenMP implementationabout the desired behaviour of waiting threads. A compliant OpenMP implementation mayor may not abide by the setting of the environment variable. Valid values are ACTIVE andPASSIVE. ACTIVE specifies that waiting threads should mostly be active, i.e. consumeprocessor cycles, while waiting. PASSIVE specifies that waiting threads should mostly bepassive, i.e. not consume processor cycles, while waiting. The details of the ACTIVE andPASSIVE behaviours are implementation defined. Examples:

setenv OMP_WAIT_POLICY ACTIVE

setenv OMP_WAIT_POLICY active

setenv OMP_WAIT_POLICY PASSIVE

setenv OMP_WAIT_POLICY passive

OMP MAX ACTIVE LEVELSNew feature available with OpenMP 3.0. Controls the maximum number of nested ac-tive parallel regions. The value of this environment variable must be a non-negative in-teger. The behaviour of the program is implementation-defined if the requested value ofOMP MAX ACTIVE LEVELS is greater than the maximum number of nested active parallellevels an implementation can support or if the value is not a non-negative integer. Example:

setenv OMP_MAX_ACTIVE_LEVELS 2

OMP THREAD LIMITNew feature available with OpenMP 3.0. Sets the number of OpenMP threads to use forthe whole OpenMP program. The value of this environment variable must be a positiveinteger. The behaviour of the program is implementation-defined if the requested value ofOMP THREAD LIMIT is greater than the number of threads an implementation can supportor if the value is not a positive integer. Example:

setenv OMP_THREAD_LIMIT 8

6.3.4 Directives format

Fortran Directives FormatFormat: (not case sensitive)

sentinel directive-name [clause ...]

All Fortran OpenMP directives must begin with a sentinel. The accepted sentinels dependon the type of Fortran source. Possible sentinels are:

!$OMP

C$OMP

*$OMP

40

Example:

!$OMP PARALLEL DEFAULT(SHARED) PRIVATE(BETA,PI)

Fixed Form Source:- !$OMP C$OMP *$OMP are accepted sentinels and must start in column 1.- All Fortran fixed form rules for line length, white space, continuation and comment

columns apply for the entire directive line.- Initial directive lines must have a space/zero in column 6.- Continuation lines must have a non-space/zero in column 6.Free Form Source:- !$OMP is the only accepted sentinel. Can appear in any column, but must be preceded

by white space only.- All Fortran free form rules for line length, white space, continuation and comment columns

apply for the entire directive line- Initial directive lines must have a space after the sentinel.- Continuation lines must have an ampersand as the last non-blank character in a line.

The following line must begin with a sentinel and then the continuation directives.General Rules:* Comments can not appear on the same line as a directive.* Only one directive name may be specified per directive.* Fortran compilers which are OpenMP enabled generally include a command line option

which instructs the compiler to activate and interpret all OpenMP directives.* Several Fortran OpenMP directives come in pairs and have the form shown below. The

”end” directive is optional but advised for readability.

!$OMP directive

[ structured block of code ]

!$OMP end directive

C / C++ Directives FormatFormat:

#pragma omp directive-name [clause, ...] newline

A valid OpenMP directive must appear after the pragma and before any clauses. Clausescan be placed in any order, and repeated as necessary, unless otherwise restricted. It is requiredthat that the pragma clause precedes the structured block which is enclosed by this directive.

Example:

#pragma omp parallel default(shared) private(beta,pi)

General Rules:* Case sensitive* Directives follow conventions of the C/C++ standards for compiler directives.* Only one directive-name may be specified per directive.* Each directive applies to at most one succeeding statement, which must be a structured

block.* Long directive lines can be ”continued” on succeeding lines by escaping the newline

character with a backslash (”\”) at the end of a directive line.PARALLEL Region ConstructPurpose: A parallel region is a block of code that will be executed by multiple threads.

This is the fundamental OpenMP parallel construct.Example:Fortran

41

!$OMP PARALLEL [clause ...]

IF (scalar_logical_expression)

PRIVATE (list)

SHARED (list)

DEFAULT (PRIVATE | FIRSTPRIVATE | SHARED | NONE)

FIRSTPRIVATE (list)

REDUCTION (operator: list)

COPYIN (list)

NUM_THREADS (scalar-integer-expression)

block

!$OMP END PARALLEL

C/C++

#pragma omp parallel [clause ...] newline

if (scalar_expression)

private (list)

shared (list)

default (shared | none)

firstprivate (list)

reduction (operator: list)

copyin (list)

num_threads (integer-expression)

structured_block

Notes:- When a thread reaches a PARALLEL directive, it creates a team of threads and becomes

the master of the team. The master is a member of that team and has thread number 0 withinthat team.

- Starting from the beginning of this parallel region, the code is duplicated and all threadswill execute that code.

- There is an implicit barrier at the end of a parallel section. Only the master threadcontinues execution past this point.

- If any thread terminates within a parallel region, all threads in the team will terminate,and the work done up until that point is undefined.

How Many Threads?The number of threads in a parallel region is determined by the following factors, in order

of precedence:1. Evaluation of the IF clause2. Setting of the NUM THREADS clause3. Use of the omp set num threads() library function4. Setting of the OMP NUM THREADS environment variable5. Implementation default - usually the number of CPUs on a node, though it could be

dynamic.Threads are numbered from 0 (master thread) to N-1.Dynamic Threads:Use the omp get dynamic() library function to determine if dynamic threads are enabled.

If supported, the two methods available for enabling dynamic threads are:1. The omp set dynamic() library routine;2. Setting of the OMP DYNAMIC environment variable to TRUE.Nested Parallel Regions:

42

Use the omp get nested() library function to determine if nested parallel regions are en-abled. The two methods available for enabling nested parallel regions (if supported) are:

1. The omp set nested() library routine2. Setting of the OMP NESTED environment variable to TRUEIf not supported, a parallel region nested within another parallel region results in the

creation of a new team, consisting of one thread, by default.Clauses:IF clause: If present, it must evaluate to .TRUE. (Fortran) or non-zero (C/C++) in order

for a team of threads to be created. Otherwise, the region is executed serially by the masterthread.

Restrictions:A parallel region must be a structured block that does not span multiple routines or code

files. It is illegal to branch into or out of a parallel region. Only a single IF clause is permitted.Only a single NUM THREADS clause is permitted.

Example: Parallel Region - Simple ”Hello World” program- Every thread executes all code enclosed in the parallel section- OpenMP library routines are used to obtain thread identifiers and total number of threads

Fortran - Parallel Region Example

INTEGER NTHREADS, TID, OMP_GET_NUM_THREADS, OMP_GET_THREAD_NUM

C Fork a team of threads with each thread having a private TID variable

!$OMP PARALLEL PRIVATE(TID)

C Obtain and print thread id

TID = OMP_GET_THREAD_NUM()

PRINT *, ’Hello World from thread = ’, TID

C Only master thread does this

IF (TID .EQ. 0) THEN

NTHREADS = OMP_GET_NUM_THREADS()

PRINT *, ’Number of threads = ’, NTHREADS

END IF

C All threads join master thread and disband

!$OMP END PARALLEL

END

C / C++ - Parallel Region Example

#include <omp.h>

main () {

int nthreads, tid;

/* Fork a team of threads with each thread having a private tid variable */

#pragma omp parallel private(tid)

{

/* Obtain and print thread id */

43

tid = omp_get_thread_num();

printf("Hello World from thread = %d\n", tid);

/* Only master thread does this */

if (tid == 0)

{

nthreads = omp_get_num_threads();

printf("Number of threads = %d\n", nthreads);

}

} /* All threads join master thread and terminate */

}

General rules of directives (for more details about these directives you can go toopenMP Directives ):

- They follow the standards and conventions of the C/C++ or Fortran compilers;- They are case sensitive;- In a directive, only one name can me specified;- Any directive can be applied only to the statement following it, which must be a structured

block.- ”Long” directives can be continued on the next lines by adding a \ at the end of the

first line of the directive.

6.3.5 The OpenMP Directives

PARALLEL region: a block will be executed in parallel by OMP NUM THREADS num-ber of threads. It is the fundamental construction in OpenMP.

Work-sharing structures:DO/for - shares an iteration of a cycle over all threads (parallel data);SECTIONS - splits the task in separated sections (functional parallel processing);SINGLE - serialises a code section.Synchronizing constructions:MASTER - only the master thread will execute the region of code;CRITICAL - that region of code will be executed only by one thread;BARRIER - all threads from the pool synchronize;ATOMIC - a certain region of memory will be updated in an atomic mode - a sort of

critical section;

FLUSH - identifies a syncronization point in which the memory must be in a consistentmode;

ORDERED - the iterations of the cycle from this directive will be executed in the sameorder like the corresponding serial execution;

THREADPRIVATE - it is used to create from the global variables, local separatedvariables which will be executed on several parallel regions.

Clauses to set the context:These are important for programming in a programming model with shared memory. It is

used together with the PARALLEL, DO/for and SECTIONS directives.PRIVATE - the variables from the list are private in every thread;SHARED - the variables from the list are shared by the threads of the current team;DEFAULT - it allows the user to set the default ”PRIVATE”, ”SHARED” or ”NONE”

for all the variables from a parallel region;

44

FIRSTPRIVATE - it combines the functionality of the clause PRIVATE with the au-tomated initialization of the variables from the list: the initialisation of the local variables ismade using the previous value from the cycle;

LASTPRIVATE - it combines the functionality of the PRIVATE clause with a copy ofthe last iteration from the current section;

COPYIN - it offers the possibility to assign the same value to the variables THREAD-PRIVATE for all the threads in the pool;

REDUCTION - it makes a reduction on the variables that appear in the list (with aspecific operation: + - * /,etc.).

6.3.6 Examples using OpenMP with C/C++

Here are some examples using OpenMP with C/C++:

/******************************************************************************

* OpenMP Example - Hello World - C/C++ Version

* FILE: omp_hello.c

* DESCRIPTION:

* In this simple example, the master thread forks a parallel region.

* All threads in the team obtain their unique thread number and print it.

* The master thread only prints the total number of threads. Two OpenMP

* library routines are used to obtain the number of threads and each

* thread’s number.

* SOURCE: Blaise Barney 5/99

* LAST REVISED:

******************************************************************************/

#include <omp.h>

main () {

int nthreads, tid;

/* Fork a team of threads giving them their own copies of variables */

#pragma omp parallel private(nthreads, tid)

{

/* Obtain thread number */


printf("Hello World from thread = %d\n", tid);


if (tid == 0)

{



}

} /* All threads join master thread and disband */

}

/******************************************************************************

* OpenMP Example - Loop Work-sharing - C/C++ Version

* FILE: omp_workshare1.c

* DESCRIPTION:

* In this example, the iterations of a loop are scheduled dynamically

* across the team of threads. A thread will perform CHUNK iterations

45

* at a time before being scheduled for the next CHUNK of work.


* LAST REVISED: 03/03/2002

******************************************************************************/

#include <omp.h>

#define CHUNKSIZE 10

#define N 100

main () {

int nthreads, tid, i, chunk;

float a[N], b[N], c[N];

/* Some initializations */

for (i=0; i < N; i++)

a[i] = b[i] = i * 1.0;

chunk = CHUNKSIZE;

#pragma omp parallel shared(a,b,c,chunk) private(i,nthreads,tid)

{


#pragma omp for schedule(dynamic,chunk)

for (i=0; i < N; i++)

{

c[i] = a[i] + b[i];

printf("tid= %d i= %d c[i]= %f\n", tid,i,c[i]);

}

if (tid == 0)

{



}

} /* end of parallel section */

}

/******************************************************************************

* OpenMP Example - Sections Work-sharing - C/C++ Version


* DESCRIPTION:

* In this example, the iterations of a loop are split into two different

* sections. Each section will be executed by one thread. Extra threads

* will not participate in the sections code.



******************************************************************************/

#include <omp.h>

#define N 50

main ()

{

int i, nthreads, tid;



for (i=0; i < N; i++)

46

a[i] = b[i] = i * 1.0;

#pragma omp parallel shared(a,b,c) private(i,tid,nthreads)

{


printf("Thread %d starting...\n",tid);

#pragma omp sections nowait

{

#pragma omp section

for (i=0; i < N/2; i++)

{

c[i] = a[i] + b[i];

printf("tid= %d i= %d c[i]= %f\n",tid,i,c[i]);

}

#pragma omp section

for (i=N/2; i < N; i++)

{

c[i] = a[i] + b[i];

printf("tid= %d i= %d c[i]= %f\n",tid,i,c[i]);

}

} /* end of sections */

if (tid == 0)

{



}


}

/******************************************************************************

* OpenMP Example - Combined Parallel Loop Reduction - C/C++ Version

* FILE: omp_reduction.c

* DESCRIPTION:

* This example demonstrates a sum reduction within a combined parallel loop

* construct. Notice that default data element scoping is assumed - there

* are no clauses specifying shared or private variables. OpenMP will

* automatically make loop index variables private within team threads, and

* global variables shared.


* LAST REVISED:

******************************************************************************/

#include <omp.h>

main () {

int i, n;

float a[100], b[100], sum;


n = 100;

for (i=0; i < n; i++)

a[i] = b[i] = i * 1.0;

47

sum = 0.0;

#pragma omp parallel for reduction(+:sum)

for (i=0; i < n; i++)

sum = sum + (a[i] * b[i]);

printf(" Sum = %f\n",sum);

}

6.3.7 Running OpenMP

On Linux machines. GNU C Compiler now provides integrated support for OpenMP. Tocompile your programs to use ”#pragma omp” directives use the -fopenmp flag in addition tothe gcc command.

In order to compile on a local station, for a C/C++ program the command used is:

- gcc -fopenmp my_program.c

In order to compile on fep.grid.pub.ro , the following command can be used (gcc):

- gcc -fopenmp -xopenmp -xO3 file_name.c -o binary_name

For defining the number of threads use a structure similar to the following one:

#define NUM_THREADS 2

combined with the function omp set num threads(NUM THREADS). Similarily,

export OMP_NUM_THREADS=4

can be used in the command line in order to create a 4 thread-example.

6.3.8 OpenMP Debugging - C/C++

In this section, there are some C and Fortran programs examples using OpenMP that havebugs. We will show you how to fix these programs, and we will shortly present a debuggingtool, called TotalView, that will be explained later in the documentation.

/******************************************************************************

* OpenMP Example - Combined Parallel Loop Work-sharing - C/C++ Version


* DESCRIPTION:

* This example attempts to show use of the parallel for construct. However

* it will generate errors at compile time. Try to determine what is causing

* the error. See omp_workshare4.c for a corrected version.



******************************************************************************/

#include <omp.h>

#define N 50

#define CHUNKSIZE 5

main () {

int i, chunk, tid;


48


for (i=0; i < N; i++)

a[i] = b[i] = i * 1.0;

chunk = CHUNKSIZE;

#pragma omp parallel for \

shared(a,b,c,chunk) \

private(i,tid) \

schedule(static,chunk)

{


for (i=0; i < N; i++)

{

c[i] = a[i] + b[i];

printf("tid= %d i= %d c[i]= %f\n", tid, i, c[i]);

}

} /* end of parallel for construct */

}

The output of the gcc command, is:

[testuser@fep ~]$ gcc -fopenmp test_openmp.c -o opens

test_openmp.c: In function \u2018main\u2019:

test_openmp.c:19: error: for statement expected before \u2018{\u2019 token

test_openmp.c:24: warning: incompatible implicit declaration

of built-in function \u2018printf\u2019

The cause of these errors is the form of the code that follows the pragma declaration. It isnot allowed to include code between the parallel for and the for loop. Also, it is not allowedto include the code that follows the pragma declaration between parenthesis (e.g.: {}).

The revised, correct form of the program above, is the following:

/******************************************************************************

* OpenMP Example - Combined Parallel Loop Work-sharing - C/C++ Version


* DESCRIPTION:

* This is a corrected version of the omp_workshare3.c example. Corrections

* include removing all statements between the parallel for construct and

* the actual for loop, and introducing logic to preserve the ability to

* query a thread’s id and print it from inside the for loop.



******************************************************************************/

#include <omp.h>

#define N 50

#define CHUNKSIZE 5

main () {

int i, chunk, tid;


char first_time;

49


for (i=0; i < N; i++)

a[i] = b[i] = i * 1.0;

chunk = CHUNKSIZE;

first_time = ’y’;

#pragma omp parallel for \

shared(a,b,c,chunk) \

private(i,tid) \

schedule(static,chunk) \

firstprivate(first_time)

for (i=0; i < N; i++)

{

if (first_time == ’y’)

{


first_time = ’n’;

}

c[i] = a[i] + b[i];

printf("tid= %d i= %d c[i]= %f\n", tid, i, c[i]);

}

}

If we easily detected the error above only by taking into consideration various syntaxmatters, things won’t work as simply every time. There are errors that cannot be detected oncompiling. In this case, a specialized debugger, called TotalView, is used. More details aboutthese debuggers you can find at the ”Debuggers” section.

/******************************************************************************

* FILE: omp_bug2.c

* DESCRIPTION:

* Another OpenMP program with a bug.

******************************************************************************/

#include <omp.h>

main () {

int nthreads, i, tid;

float total;

/*** Spawn parallel region ***/

#pragma omp parallel

{




if (tid == 0) {



}

printf("Thread %d is starting...\n",tid);

#pragma omp barrier

/* do some work */

50

total = 0.0;

#pragma omp for schedule(dynamic,10)

for (i=0; i<1000000; i++)

total = total + i*1.0;

printf ("Thread %d is done! Total= %f\n",tid,total);

} /*** End of parallel region ***/

}

The bugs in this case are caused by neglecting to scope the TID and TOTAL variables asPRIVATE. By default, most OpenMP variables are scoped as SHARED. These variables needto be unique for each thread. It is also necessary to include stdio.h in order to have nowarnings.

The repaired form of the program, is the following:

#include <omp.h>

#include <stdio.h>

main () {

int nthreads, i, tid;

float total;

/*** Spawn parallel region ***/

#pragma omp parallel private(tid,total)

{




if (tid == 0) {



}

printf("Thread %d is starting...\n",tid);

#pragma omp barrier

/* do some work */

total = 0.0;

#pragma omp for schedule(dynamic,10)

for (i=0; i<1000000; i++)

total = total + i*1.0;

printf ("Thread %d is done! Total= %f\n",tid,total);

} /*** End of parallel region ***/

}

/******************************************************************************

* FILE: omp_bug3.c

* DESCRIPTION:

* Run time error

* AUTHOR: Blaise Barney 01/09/04


******************************************************************************/

#include <omp.h>

#include <stdio.h>

51

#include <stdlib.h>

#define N 50

int main (int argc, char *argv[])

{

int i, nthreads, tid, section;


void print_results(float array[N], int tid, int section);


for (i=0; i<N; i++)

a[i] = b[i] = i * 1.0;

#pragma omp parallel private(c,i,tid,section)

{


if (tid == 0)

{



}

/*** Use barriers for clean output ***/

#pragma omp barrier

printf("Thread %d starting...\n",tid);

#pragma omp barrier


{

#pragma omp section

{

section = 1;

for (i=0; i<N; i++)

c[i] = a[i] * b[i];

print_results(c, tid, section);

}

#pragma omp section

{

section = 2;

for (i=0; i<N; i++)

c[i] = a[i] + b[i];

print_results(c, tid, section);

}


/*** Use barrier for clean output ***/

#pragma omp barrier

printf("Thread %d exiting...\n",tid);


}

void print_results(float array[N], int tid, int section) {

int i,j;

j = 1;

52

/*** use critical for clean output ***/

#pragma omp critical

{

printf("\nThread %d did section %d. The results are:\n", tid, section);

for (i=0; i<N; i++) {

printf("%e ",array[i]);

j++;

if (j == 6) {

printf("\n");

j = 1;

}

}

printf("\n");

} /*** end of critical ***/

#pragma omp barrier

printf("Thread %d done and synchronized.\n", tid);

}

Solving the problem:The run time error is caused by by the OMP BARRIER directive in the PRINT RESULTS

subroutine. By definition, an OMP BARRIER can not be nested outside the static extent ofa SECTIONS directive. In this case it is orphaned outside the calling SECTIONS block. Ifyou delete the line with ”#pragma omp barrier” from the print results function, the programwon’t hang anymore.

/******************************************************************************

* FILE: omp_bug4.c

* DESCRIPTION:

* This very simple program causes a segmentation fault.



******************************************************************************/

#include <omp.h>

#include <stdio.h>

#include <stdlib.h>

#define N 1048


{

int nthreads, tid, i, j;

double a[N][N];

/* Fork a team of threads with explicit variable scoping */

#pragma omp parallel shared(nthreads) private(i,j,tid,a)

{

/* Obtain/print thread info */


if (tid == 0)

{



53

}

printf("Thread %d starting...\n", tid);

/* Each thread works on its own private copy of the array */

for (i=0; i<N; i++)

for (j=0; j<N; j++)

a[i][j] = tid + i + j;

/* For confirmation */

printf("Thread %d done. Last element= %f\n",tid,a[N-1][N-1]);

} /* All threads join master thread and disband */

}

If you run the program, you can see that it causes segmentation fault. OpenMP thread stacksize is an implementation dependent resource. In this case, the array is too large to fit into thethread stack space and causes the segmentation fault. You have to modify the environmentvariable, for Linux, KMP STACKSIZE with the value 20000000.

******************************************************************************

* FILE: omp_bug5.c

* DESCRIPTION:

* Using SECTIONS, two threads initialize their own array and then add

* it to the other’s array, however a deadlock occurs.



******************************************************************************/

#include <omp.h>

#include <stdio.h>

#include <stdlib.h>

#define N 1000000

#define PI 3.1415926535

#define DELTA .01415926535

int main (int argc, char *argv[]) {

int nthreads, tid, i;

float a[N], b[N];

omp_lock_t locka, lockb;

/* Initialize the locks */

omp_init_lock(&locka);

omp_init_lock(&lockb);


#pragma omp parallel shared(a, b, nthreads, locka, lockb) private(tid)

{

/* Obtain thread number and number of threads */


#pragma omp master

{



}


#pragma omp barrier

54


{

#pragma omp section

{

printf("Thread %d initializing a[]\n",tid);

omp_set_lock(&locka);

for (i=0; i<N; i++)

a[i] = i * DELTA;

omp_set_lock(&lockb);

printf("Thread %d adding a[] to b[]\n",tid);

for (i=0; i<N; i++)

b[i] += a[i];

omp_unset_lock(&lockb);

omp_unset_lock(&locka);

}

#pragma omp section

{

printf("Thread %d initializing b[]\n",tid);


for (i=0; i<N; i++)

b[i] = i * PI;


printf("Thread %d adding b[] to a[]\n",tid);

for (i=0; i<N; i++)

a[i] += b[i];



}


} /* end of parallel region */

}

EXPLANATION:The problem in omp bug5 is that the first thread acquires locka and then tries to get lockb

before releasing locka. Meanwhile, the second thread has acquired lockb and then tries to getlocka before releasing lockb. The solution overcomes the deadlock by using locks correctly.

/******************************************************************************

* FILE: omp_bug5fix.c



******************************************************************************/

#include <omp.h>

#include <stdio.h>

#include <stdlib.h>

#define N 1000000

#define PI 3.1415926535

#define DELTA .01415926535


55

{

int nthreads, tid, i;

float a[N], b[N];

omp_lock_t locka, lockb;

/* Initialize the locks */

omp_init_lock(&locka);

omp_init_lock(&lockb);


#pragma omp parallel shared(a, b, nthreads, locka, lockb) private(tid)

{

/* Obtain thread number and number of threads */


#pragma omp master

{



}


#pragma omp barrier


{

#pragma omp section

{

printf("Thread %d initializing a[]\n",tid);


for (i=0; i<N; i++)

a[i] = i * DELTA;



printf("Thread %d adding a[] to b[]\n",tid);

for (i=0; i<N; i++)

b[i] += a[i];


}

#pragma omp section

{

printf("Thread %d initializing b[]\n",tid);


for (i=0; i<N; i++)

b[i] = i * PI;



printf("Thread %d adding b[] to a[]\n",tid);

for (i=0; i<N; i++)

a[i] += b[i];


}


} /* end of parallel region */

56

}

6.3.9 OpenMP Debugging - FORTRAN

C******************************************************************************

C FILE: omp_bug1.f

C DESCRIPTION:

C This example attempts to show use of the PARALLEL DO construct. However

C it will generate errors at compile time. Try to determine what is causing

C the error. See omp_bug1fix.f for a corrected version.

C AUTHOR: Blaise Barney 5/99

C LAST REVISED:

C******************************************************************************

PROGRAM WORKSHARE3

INTEGER TID, OMP_GET_THREAD_NUM, N, I, CHUNKSIZE, CHUNK

PARAMETER (N=50)

PARAMETER (CHUNKSIZE=5)

REAL A(N), B(N), C(N)

! Some initializations

DO I = 1, N

A(I) = I * 1.0

B(I) = A(I)

ENDDO

CHUNK = CHUNKSIZE

!$OMP PARALLEL DO SHARED(A,B,C,CHUNK)

!$OMP& PRIVATE(I,TID)

!$OMP& SCHEDULE(STATIC,CHUNK)


DO I = 1, N

C(I) = A(I) + B(I)

PRINT *,’TID= ’,TID,’I= ’,I,’C(I)= ’,C(I)

ENDDO

!$OMP END PARALLEL DO

END

EXPLANATION:This example illustrates the use of the combined PARALLEL for-DO directive. It fails

because the loop does not come immediately after the directive. Corrections include removingall statements between the PARALLEL for-DO directive and the actual loop. Also, logic isadded to preserve the ability to query the thread id and print it from inside the loop. Noticethe use of the FIRSTPRIVATE clause to intialise the flag.

C******************************************************************************

C FILE: omp_bug1fix.f

C DESCRIPTION:

C This is a corrected version of the omp_bug1fix.f example. Corrections

C include removing all statements between the PARALLEL DO construct and

C the actual DO loop, and introducing logic to preserve the ability to

C query a thread’s id and print it from inside the DO loop.

C AUTHOR: Blaise Barney 5/99

57

C LAST REVISED:

C******************************************************************************

PROGRAM WORKSHARE4

INTEGER TID, OMP_GET_THREAD_NUM, N, I, CHUNKSIZE, CHUNK

PARAMETER (N=50)

PARAMETER (CHUNKSIZE=5)

REAL A(N), B(N), C(N)

CHARACTER FIRST_TIME

! Some initializations

DO I = 1, N

A(I) = I * 1.0

B(I) = A(I)

ENDDO

CHUNK = CHUNKSIZE

FIRST_TIME = ’Y’

!$OMP PARALLEL DO SHARED(A,B,C,CHUNK)

!$OMP& PRIVATE(I,TID)

!$OMP& SCHEDULE(STATIC,CHUNK)

!$OMP& FIRSTPRIVATE(FIRST_TIME)

DO I = 1, N

IF (FIRST_TIME .EQ. ’Y’) THEN


FIRST_TIME = ’N’

ENDIF

C(I) = A(I) + B(I)

PRINT *,’TID= ’,TID,’I= ’,I,’C(I)= ’,C(I)

ENDDO

!$OMP END PARALLEL DO

END

C******************************************************************************

C FILE: omp_bug5.f

C DESCRIPTION:

C Using SECTIONS, two threads initialize their own array and then add

C it to the other’s array, however a deadlock occurs.

C AUTHOR: Blaise Barney 01/09/04

C LAST REVISED:

C******************************************************************************

PROGRAM BUG5

INTEGER*8 LOCKA, LOCKB

INTEGER NTHREADS, TID, I,

+ OMP_GET_NUM_THREADS, OMP_GET_THREAD_NUM

PARAMETER (N=1000000)

REAL A(N), B(N), PI, DELTA

PARAMETER (PI=3.1415926535)

PARAMETER (DELTA=.01415926535)

58

C Initialize the locks

CALL OMP_INIT_LOCK(LOCKA)

CALL OMP_INIT_LOCK(LOCKB)

C Fork a team of threads giving them their own copies of variables

!$OMP PARALLEL SHARED(A, B, NTHREADS, LOCKA, LOCKB) PRIVATE(TID)

C Obtain thread number and number of threads


!$OMP MASTER



!$OMP END MASTER

PRINT *, ’Thread’, TID, ’starting...’

!$OMP BARRIER

!$OMP SECTIONS

!$OMP SECTION

PRINT *, ’Thread’,TID,’ initializing A()’

CALL OMP_SET_LOCK(LOCKA)

DO I = 1, N

A(I) = I * DELTA

ENDDO

CALL OMP_SET_LOCK(LOCKB)

PRINT *, ’Thread’,TID,’ adding A() to B()’

DO I = 1, N

B(I) = B(I) + A(I)

ENDDO

CALL OMP_UNSET_LOCK(LOCKB)

CALL OMP_UNSET_LOCK(LOCKA)

!$OMP SECTION

PRINT *, ’Thread’,TID,’ initializing B()’


DO I = 1, N

B(I) = I * PI

ENDDO


PRINT *, ’Thread’,TID,’ adding B() to A()’

DO I = 1, N

A(I) = A(I) + B(I)

ENDDO



!$OMP END SECTIONS NOWAIT

PRINT *, ’Thread’,TID,’ done.’

!$OMP END PARALLEL

END

C******************************************************************************

C FILE: omp_bug5fix.f

C DESCRIPTION:

C The problem in omp_bug5.f is that the first thread acquires locka and then

C tries to get lockb before releasing locka. Meanwhile, the second thread

59

C has acquired lockb and then tries to get locka before releasing lockb.

C This solution overcomes the deadlock by using locks correctly.

C AUTHOR: Blaise Barney 01/09/04

C LAST REVISED:

C******************************************************************************

PROGRAM BUG5

INTEGER*8 LOCKA, LOCKB

INTEGER NTHREADS, TID, I, OMP_GET_NUM_THREADS, OMP_GET_THREAD_NUM

PARAMETER (N=1000000)

REAL A(N), B(N), PI, DELTA

PARAMETER (PI=3.1415926535)

PARAMETER (DELTA=.01415926535)

C Initialize the locks

CALL OMP_INIT_LOCK(LOCKA)

CALL OMP_INIT_LOCK(LOCKB)

C Fork a team of threads giving them their own copies of variables

!$OMP PARALLEL SHARED(A, B, NTHREADS, LOCKA, LOCKB) PRIVATE(TID)

C Obtain thread number and number of threads


!$OMP MASTER



!$OMP END MASTER

PRINT *, ’Thread’, TID, ’starting...’

!$OMP BARRIER

!$OMP SECTIONS

!$OMP SECTION

PRINT *, ’Thread’,TID,’ initializing A()’


DO I = 1, N

A(I) = I * DELTA

ENDDO



PRINT *, ’Thread’,TID,’ adding A() to B()’

DO I = 1, N

B(I) = B(I) + A(I)

ENDDO


!$OMP SECTION

PRINT *, ’Thread’,TID,’ initializing B()’


DO I = 1, N

B(I) = I * PI

ENDDO



60

PRINT *, ’Thread’,TID,’ adding B() to A()’

DO I = 1, N

A(I) = A(I) + B(I)

ENDDO


!$OMP END SECTIONS NOWAIT

PRINT *, ’Thread’,TID,’ done.’

!$OMP END PARALLEL

END

6.4 Debuggers

6.4.1 Sun Studio Integrated Debugger

Sun Studio includes a debugger for serial and multithreaded programs. You will find moreinformation on how to use this environment online. You may start debugging your programfrom Run - Debug executable. There are a series of actions that are available from the Runmenu but which may be specified from the command line when starting Sun Studio. In orderto just start a debuggind session, you can attach to a running program by:

$ sunstudio -A pid[:program_name]

or from Run - Attach Debugger. To analyse a core dump, use:

$ sunstudio -C core[:program_name]

or Run - Debug core file

6.4.2 TotalView

TotalView is a sophisticated software debugger product of TotalView Technologies that hasbeen selected as the Department of Energy’s Advanced Simulation and Computing program’sdebugger. It is used for debugging and analyzing both serial and parallel programs and it isespecially designed for use with complex, multi-process and/or multi-threaded applications.It is designed to handle most types of HPC parallel coding and it is supported on mostHPC platforms. It provides both a GUI and command line interface and it can be used todebug programs, running processes, and core files, including also memory debugging features.It provides graphical visualization of array data and icludes a comprehensive built-in helpsystem.

Supported Platforms and LanguagesSupported languages include the usual HPC application languages:o C/C++o Fortran77/90o Mixed C/C++ and Fortrano AssemblerCompiling Your Program-g :Like many UNIX debuggers, you will need to compile your program with the appropriate

flag to enable generation of symbolic debug information. For most compilers, the -g option isused for this. TotalView will allow you to debug executables which were not compiled withthe -g option. However, only the assembler code can be viewed.

Beyond -g :

61

Don’t compile your program with optimization flags while you are debugging it. Compileroptimizations can ”rewrite” your program and produce machine code that doesn’t necessarilymatch your source code. Parallel programs may require additional compiler flags.

OverviewTotalView is a full-featured, source-level, graphical debugger for C, C++, and Fortran (77

and 90), assembler, and mixed source/assembler codes based on the X Window System fromEtnus. TotalView supports MPI, PVM and HPF. Information on TotalView is available inthe release notes and user guide at the Etnus Online Documentation page. Also see ”mantotalview” for command syntax and options. Note: In order to use TotalView, you must beusing a terminal or workstation capable of displaying X Windows. See Using the X WindowSystem for more information.

TotalView on Linux ClustersTotalView is available on NCSA’s Linux Clusters. On Abe there is a 384 token TotalView

license and you only checkout the number of licenses you need . We do not currently have away to guarantee you will get a license when your job starts if you run in batch. GNU andIntel compilers are both supported.

Important: For both compilers you need to compile and link your code with -g to enablesource code listing within TotalView. TotalView is also supported on Abe.

Starting TotalView on the cluster:To use TotalView over a cluster there are a few steps to follow. It is very important to

have the $HOME directory shared over the network between nodes.11 steps that show you how to run TotalView to debug your program:1. Download an NX client on your local station (more details about how you need to

configure it you can find at the Environment chapter, File Management subsection from thepresent Clusterguide);

2. Connect with the NX client to the cluster;3. Write or upload your document in the home folder;4. Compile it using the proper compiler and add -g at the compiling symbols. It produces

debugging information in the system’s native format.

gcc -fopenmp -O3 -g -o app_lab4_gcc openmp_stack_quicksort.c

5. Find the value for $DISPLAY using the command:

$ echo $DISPLAY

You also need to find out the port the X11 connection is forwarded on. For example, ifDISPLAY is host:14.0 the connection port will be 6014.

setenv DISPLAY fep.grid.pub.ro:14.0

6. Type the command xhost +, in order to disable the access control

$ xhost +

Now access control is disabled and clients can connect from any host (X11 connections areallowed).

7. Currently, the only available queue is ibm-quad.q and the version of TotalView istotalview-8.6.2-2. To find out what queues are available, use the command:

qconf -sql

8. Considering the ibm-quad.q queue available, write after the following command:

62

qsub -q ibm-quad.q -cwd

9. After running the command above you have to write the following lines (:1000.0 can bereplaced by the value that you obtained after typing the command echo $DISPLAY):

module load debuggers/totalview-8.4.1-7

setenv DISPLAY fep.grid.pub.ro:1000.0

totalview

10. After that, press Ctrl+D in order to submit your request to the queue.When the xterm window will appear, we will launch TotalView. If the window does not

appear check the job output. This should give you some clue and why your script failed;maybe you misspelled a command or maybe the port for the X11 forwarding is closed fromthe firewall. You should check these two things first.

11. Open /opt/totalview/toolworks/totalview.8.6.2-2/bin and run the totalview

Now you should be able to see the graphical interface of TotalView. Here are some picturesto help you interact with it. Next you have to select the executable file corresponding to yourprogram, which has been previously compiled using the ”-g” option, from the appropriatefolder placed on the cluster(the complete path to your home folder).

63

Next you need to select the parallel environment you want to use.

The next step is to set the number of tasks you want to run:

64

Here is an example of how to debug your source. Try to use the facilities and optionsoffered by TotalView, combined with the examples shown in the tutorials below.

Some helpful links, to help you with Totalview:Total View TutorialTotal View Excercise

TotalView Command Line Interpreter

65

The TotalView Command Line Interpreter (CLI) provides a command line debugger in-terface. It can be launched either stand-alone or via the TotalView GUI debugger.

The CLI consists of two primary components:o The CLI commandso A Tcl interpreterBecause the CLI includes a Tcl interpreter, CLI commands can be integrated into user-

written Tcl programs/scripts for ”automated” debugging. Of course, putting the CLI to realuse in this manner will require some expertise in Tcl.

Most often, the TotalView GUI is the method of choice for debugging. However, the CLImay be the method of choice in those circumstances where using the GUI is impractical:

o When a program takes several days to execute.o When the program must be run under a batch scheduling system or network conditions

that inhibit GUI interaction.o When network traffic between the executing program and the person debugging is not

permitted or limits the use of the GUI.See the TotalView documentation located at Total View Official Site for details:o TotalView User Guide - relevant chapterso TotalView Reference Guide - complete coverage of all CLI commands, variables and

usage.Starting an Interactive CLI Debug Session:Method 1: From within the TotalView GUI:1. Use either path:

Process Window > Tools Menu > Command Line

Root Window > Tools Menu > Command Line

2. A TotalView CLI xterm window (below) will then open for you to enter CLI commands.3. Load/Start your executable or attach to a running process4. Issue CLI commandsMethod 2: From a shell prompt window:1. Invoke the totalviewcli command (provided that it is in your path).2. Load/Start your executable or attach to a running process3. Issue CLI commandsCLI Commands:As of TotalView version 8, there are approximately 75 CLI commands. These are covered

completely in the TotalView Reference Guide.Some representative CLI commands are shown in the subsequent table.

66

Environment Commandsalias creates or views user-defined commandscapture allows commands that print information to send their output to a string variabledgroups manipulates and manages groupsdset changes or views values of CLI state variablesdunset restores default settings of CLI state variableshelp displays help informationstty sets terminal propertiesunalias removes a previously defined commanddworker adds or removes a thread from a workers groupCLI initialization and terminationdattach attaches to one/more processes executing in the normal run-time environmentddetach detaches from processesdkill kills existing user process, leaving debugging information in placedload loads debugging information about the target program & prepares it for executiondreload reloads the current executabledrerun restarts a processdrun starts or restarts the execution of users processes under control of the CLIdstatus shows current status of processes and threadsquit exits from the CLI, ending the debugging session

Program Informationdassign changes the value of a scalar variabledlist browses source code relative to a particular file, procedure or linedmstat displays memory use informationdprint evaluates an expression or program variable and displays the resulting valuedptsets shows status of processes and threadsdwhat determines what a name refers todwhere prints information about the target thread’s stack

Execution Controldcont continues execution of processes and waits for themdfocus changes the set of process, threads, or groups upon which a CLI command actsdgo resumes execution of processes (without blocking)dhalt suspends execution of processesdhold holds threads or processesdnext executes statements, moving into subfunctions if requireddnexti executes machine instructions, stepping over subfunctionsdout runs out from the current subroutinedstep executes statements, moving into subfunctions if requireddstepi executes machine instructions, moving into subfunctions if requireddunhold releases a held process or threadduntil runs the process until a target place is reacheddwait blocks command input until processes stop

67

Action Pointsdactions views information on action point definitions and their current statusdbarrier defines a process or thread barrier breakpointdbreak defines a breakpointddelete deletes an action pointddisable temporarily disables an action pointdenable reenables an action point that has been disableddwatch defines a watchpoint

Miscellaneousdcache clears the remote library cacheddown moves down the call stackdflush unwinds stack from suspended computationsdlappend appends list elements to a TotalView variabledup moves up the call stack

68

7 Parallelization

Parallelization for computers with shared memory (SM) means the automatic distributionof loop iterations over several processors(autoparallelization), the explicit distribution of workover the processors by compiler directives (OpenMP) or function calls to threading libraries,or a combination of those.

Parallelization for computers with distributed memory (DM) is done via the explicit dis-tribution of work and data over the processors and their coordination with the exchange ofmessages (Message Passing with MPI). MPI programs run on shared memory computers aswell, whereas OpenMP programs usually do not run on computers with distributed memory.

There are solutions that try to achieve the programming ease of shared memory paralleliza-tion on distributed memory clusters. For example Intel’s Cluster OpenMP offers a relativelyeasy way to get OpenMP programs running on a cluster.

For large applications the hybrid parallelization approach, a combination of coarse-grainedparallelism with MPI and underlying fine-grained parallelism with OpenMP, might be attrac-tive, in order to use as many processors efficiently as possible.

Please note, that large computing jobs should not be started interactively, and that whensubmitting use of batch jobs, the GridEngine batch system determines the distribution of theMPI tasks on the machines to a large extent.

7.1 Shared Memory Programming

For shared memory programming, OpenMP http://www.openmp.org is the de facto stan-dard. The OpenMP API is defined for Fortran, C and C++ and consists of compiler directives,runtime routines and environment variables.

In the parallel regions of a program several threads are started. They execute the containedprogram segment redundantly, until they hit a worksharing construct . Within this construct,the contained work (usually do- or for-loops) is distributed among the threads. Under normalconditions all threads have access to all data (shared data). But pay attention: if data,accessed by several threads, is modified, then the access to this data must be protected withcritical regions or OpenMP locks. Also private data areas can be used, where the individualthreads hold their local data. Such private data (in OpenMP terminology) is only visible tothe thread owning it. Other threads will not be able to read or write private data.

Note: In many cases, the stack area for the slave threads must be increased by changing acompiler specific environment variable (e.g. Sun Studio: stacksize, Intel:kmp stacksize), andthe stack area for the master thread must be increased with the command ulimit -s xxx (zshshell, specification in KB) or limit s xxx (C-shell, in KB).

Hint: In a loop, which is to be parallelized, the results must not depend on the order ofthe loop iterations! Try to run the loop backwards in serial mode. The results should be thesame. This is a necessary, but not a suficient condition! The number of the threads has to bespecied by the environment variable omp num threads.

Note: The OpenMP standard does not specify the value for omp num threads in case it isnot set explicitly. If omp num threads is not set, then Sun OpenMP for example starts only 1thread, as opposed to the Intel compiler which starts as many threads as there are processorsavailable.

On a loaded system fewer threads may be employed than specified by this environmentvariable, because the dynamic mode may be used by default. Use the environment variableomp dynamic to change this behavior. If you want to use nested OpenMP, the environmentvariable omp nested=true has to be set.

69

The OpenMP compiler options have been sumarized in the following. These compilerflags will be set in the environment variables FLAGS AUTOPAR and FLAGS OPENMP (asexplained in section 6.1.1).

Compiler flags openmp flags autoparSun -xopenmp -xautopar -xreductionIntel -openmp -parallelGNU -fopenmp (4.2 and above) n.a. (planned for 4.3)PGI -mp -Mconcur -Minline

An example program using OpenMP is/export/home/stud/username/PRIMER/PROFILE/openmp only.c.

7.1.1 Automatic Shared Memory Parallelization of Loops

The Sun Fortran, C and C++-compilers are able to parallelize loops automatically. Successor failure depends on the compiler’s ability to prove it is safe to parallelize a (nested) loop.This is often application area specific (e.g. finite differences versus finite elements), language(pointers and function calls may make the analysis difficult) and coding style dependent. Theappropriate option is -xautopar which includes -depend -xO3. Although the -xparallel optionis also available, we do not recommend to use this. The option combines automatic andexplicit parallelization, but assumes the older Sun parallel programming model is used insteadof OpenMP. In case one would like to combine automatic parallelization and OpenMP, westrongly suggest to use the -xautopar -xopenmp combination. With the option -xreduction,automatic parallelization of reductions is also permitted, e.g. accumulations, dot productsetc., whereby the modification of the sequence of the arithmetic operation can cause differentrounding error accumulations. Compiling with the option -xloopinfo supplies informationabout the parallelization. The compiler messages are shown on the screen. If the numberof loop iterations is unknown during compile time, then code is produced, which decides atrun-time whether a parallel execution of the loop is more efficient or not (alternate coding).Also with automatic parallelization, the number of the used threads can be specified by theenvironment variable omp num threads.

7.1.2 GNU Compilers

With version 4.2 the GNU compiler collection supports OpenMP with the option -fopenmp.It supports nesting using the standard OpenMP environment variables. Using the variableGOMP STACKSIZE one can also set the default thread stack size (in kilobytes).

CPU binding can be done with the GOMP CPU AFFINITY environment variable.The variable should contain a space or comma-separated list of CPU’s. This list may con-tain different kind of entries: either single CPU numbers in any order, a range of CPU’s(M-N) or a range with some stride (M-N:S). CPU numbers are zero based. For example,GOMP CPU AFFINITY=”0 3 1-2 4-15:2” will bind the initial thread to CPU 0, thesecond to CPU 3, the third to CPU 1, the fourth to CPU 2, the fifth to CPU 4, the sixththrough tneth to CPU’s 6, 8, 10, 12 and 14 respectively and then start assigning back fromthe beggining of the list. CGOMP CPU AFFINITY=0 binds all threads to CPU 0.

Automatic Shared Memory Parallelization of Loops Since version 4.3, GNU compil-ers are able to parallelize loops automatically using the -ftree-parallelize-loops=[threads]option. However the number of threads to use have to be specified at compile time and thusare fix at runtime.

70

7.1.3 Intel Compilers

By adding the option -openmp the OpenMP directives are interpreted by the Intel com-pilers. Nested OpenMP is supported too.

The slave threads stack size can be increased with the environment variable kmp stacksize=megabytesM.

Attention: By default the number of threads is set to the number of processors. It isnot recommended to set this variable to larger values than the number of processors availableon the current machine. By default, the environment variables omp dynamic and omp nestedare set to false. Intel compilers provide an easy way for processor binding. Just set theenvironment variable kmp affinity to compact or scatter, e.g.

$ export KMP_AFFINITY=scatter

Compact binds the threads as near as possible, e.g. two threads on different cores of oneprocessor chip. Scatter binds the threads as far away as possible, e.g. two threads, each onone core on different processor sockets.

Automatic Shared Memory Parallelization of Loops The Intel Fortran, C and C++compilers are able to parallelize certain loops automatically. This feature can be turned onwith the option -parallel. The number of the used threads is specified by the environmentvariable OMP NUM THREADS.

Note: using the option -O2 enables automatic inlining which may help the automaticparallelization, if functions are called within a loop.

7.1.4 PGI Compilers

By adding the option -mp the OpenMP directives, according to the OpenMP version 1specifications, are interpreted by the PGI compilers.

Explicit parallelization can be combined with the automatic parallelization of the compiler.Loops within parallel OpenMP regions are no longer subject to automatic parallelization.Nested parallelization is not supported. The slave threads stack size can be increased withthe environment variable mpstkz=megabytes M.

By default omp num threads is set to 1. It is not recommended to set this variable to alarger value than the number of processors available on the current machine. The environmentvariables omp dynamic and omp nested have no effect!

The PGI compiler offers some support for NUMA architectures like the V40z Opteronsystems with the option -mp=numa. Using NUMA can improve performance of some parallelapplications by reducing memory latency. Linking -mp=numa also allows you to use theenvironment variables mp bind, mp blist and mp spin. When mp bind is set to yes, parallelprocesses or threads are bound to a physical processor. This ensures that the operating systemwill not move your process to a different CPU while it is running. Using mp blist, you canspecify exactly which processors to attach your process to. For example, if you have a quadsocket dual core system (8 CPUs), you can set the blist so that the processes are interleavedacross the 4 sockets (MP BLIST=2,4,6,0,1,3,5,7) or bound to a particular (MP BLIST=6,7).

Threads at a barrier in a parallel region check a semaphore to determine if they can proceed.If the semaphore is not free after a certain number of tries, the thread gives up the processorfor a while before checking again. The mp spin variable denes the number of times a threadchecks a semaphore before idling. Setting mp spin to -1 tells the thread never to idle. Thiscan improve performance but can waste CPU cycles that could be used by a different processif the thread spends a significant amount of time in a barrier.

71

Automatic Shared Memory Parallelization of Loops The PGI Fortran, C and C++compilers are able to parallelize certain loops automatically. This feature can be turned onwith the option -Mconcur. The number of the used threads is also specified by the environmentvariable OMP NUM THREADS.

Note: Using the option -Minline the compiler tries to inline functions. So even loops withfunction calls may be parallelized.

7.2 Message Passing with MPI

MPI (Message-Passing Interface) is the de-facto standard for parallelization on distributedmemory parallel systems. Multiple processes explicitly exchange data and coordinate theirwork flow. MPI specifies the interface but not the implementaion. Therefore, there areplenty of implementations for PC’s as well as for supercomputers. There are freely availableimplementations and commercial ones, which are particularly tuned for the target platform.MPI has a huge number of calls, although it is possible to write meaningful MPI applicationsjust employing some 10 of these calls.

An example program using MPI is:/export/home/stud/alascateu/PRIMER/PROFILE/mpi.c.

7.2.1 OpenMPI

The Open MPI Project (www.openmpi.org) is an open source MPI-2 implementation thatis developed and maintained by a consortium of academic, research, and industry partners.Open MPI is therefore able to combine the expertise, technologies, and resources from allacross the High Performance Computing community in order to build the best MPI libraryavailable. Open MPI offers advantages for system and software vendors, application developersand computer science researchers.

The compiler drivers are mpicc for C, mpif77 and mpif90 for FORTRAN, mpicxx andmpiCC for C++.

mpirun is used to start a MPI program. Refere to the manual page for a detailed descrip-tion of mpirun ( $ man mpirun).

We have several Open MPI implementations. To use the one suitable for your programs,you must load the appropriate module (remember to also load the corresponding compilermodule). For example, if you want to use the PGI implementation you should type thefollowing:

$ module list

Currently Loaded Modulefiles:

1) batch-system/sge-6.2u3 4) switcher/1.0.13

2) compilers/sunstudio12.1 5) oscar-modules/1.0.5

3) mpi/openmpi-1.3.2_sunstudio12.1

$ module avail

[...]

-------------------- /opt/modules/modulefiles --------------------

apps/hrm debuggers/totalview-8.6.2-2

apps/matlab grid/gLite-UI-3.1.31-Prod

apps/uso09 java/jdk1.6.0_13-32bit

batch-system/sge-6.2u3 java/jdk1.6.0_13-64bit

cell/cell-sdk-3.1 mpi/openmpi-1.3.2_gcc-4.1.2


72

compilers/gcc-4.4.0 mpi/openmpi-1.3.2_intel-11.0_081

compilers/intel-11.0_081 mpi/openmpi-1.3.2_pgi-7.0.7

compilers/pgi-7.0.7 mpi/openmpi-1.3.2_sunstudio12.1

compilers/sunstudio12.1 oscar-modules/1.0.5(default)

Load the PGI implementation of MPI$ module switch mpi/openmpi-1.3.2_pgi-7.0.7

Load the PGI compiler$ module switch compilers/pgi-7.0.7

Now if you type mpicc you’ll see that the wrapper calls pgcc.

7.2.2 Intel MPI Implementation

Intel-MPI is a commercial implementation based on mpich2 which is a public domainimplementation of the MPI 2 standard provided by the Mathematics and Computer ScienceDivision of the Argonne National Laboratory.

The compiler drivers mpifc, mpiifort, mpiicc, mpiicpc, mpicc and mpicxx and theinstruction for starting an MPI application mpiexec will be included in the search path.

There are two dierent versions of compiler driver: mpiifort, mpiicc and mpiicpc are thecompiler driver for Intel Compiler. mpifc, mpicc and mpicxx are the compiler driver forGCC (GNU Compiler Collection).

To use the Intel implementation you must load the apporpriate modules just like in thePGI exemple in the OpenMPI section.

Examples:$ mpiifort -c ... *.f90

$ mpiicc -o a.out *.o

$ mpirun -np 4 a.out:

$ ifort -I$MPI_INCLUDE -c prog.f90

$ mpirun -np 4 a.out

7.3 Hybrid Parallelization

The combination of MPI and OpenMP and/or autoparallelization is called hybrid paralleliza-tion. Each MPI process may be multi-threaded. In order to use hybrid parallelization theMPI library has to support it. There are 4 stages of possible support:

1. single - multi-threading is not supported.

2. funneled - only the main thread, which initializes MPI, is allowed to make MPI calls.

3. serialized - only one thread may call the MPI library at a time.

4. multiple - multiple threads may call MPI, without restrictions.

You can use the MPI Init thread function to query multi-threading support of the MPIimplementation.

A quick example of a hybrid program is/export/home/stud/alascateu/PRIMER/PROFILE/hybrid.c.

It is a standard Laplace equation program, with MPI support, in witch a simple OpenMPmatrix multiply program was inserted. Thus, every process distributed over the cluster willspawn multiple threads that will multiply some random matrices. The matrix dimensions

73

where augmented so the program would run sufficient time to collect experiment data withthe Sun Analyzer presented in the Performance / Runtime Analysis Tools section. Turun the program (C environment in the example) compile it as a MPI program but withOpenMP support:$ mpicc -fopenmp hybrid.c -o hybrid

Run with (due to the Laplace layout you need 4 processors):$ mpirun -np 4 hybrid

7.3.1 Hybrid Parallelization with Intel-MPI

Unfortunately Intel-MPI is not thread safe by default. Calls to the MPI library should notbe made inside of parallel regions if the library is not linked to the program. To provide fullMPI support inside parallel regions the program must be linked with the option -mt mpi.

Note: If you specify either the -openmp or the -parallel option of the Intel C Compiler,the thread safe version of the library is used.

Note: If you specify one of the following options for the Intel Fortran Compiler, the threadsafe version of the library is used:

1. -openmp

2. -parallel

3. -threads

4. -reentrancy

5. -reentrancy threaded

74

8 Performance / Runtime Analysis Tools

This chapter describes tools that are available to help you assess the performance of yourcode, identify potential performance problems, and locate the part of the code where most ofthe execution time is spent. It also covers the installation and run of an Intel MPI benchmark.

8.1 Sun Sampling Collector and Performance Analyzer

The Sun Sampling Collector and the Performance Analyzer are a pair of tools that youcan use to collect and analyze performance data for your serial or parallel application. TheCollector gathers performance data by sampling at regular time intervals and by tracingfunction calls. The performance information is gathered in so called experiment files, whichcan then be displayed with the analyzer GUI or the er print line command after the programhas finished. Since the collector is part of the Sun compiler suite the studio compiler modulehas to be loaded. However programs to be analyzed do not have to be compiled with the Suncompiler, the GNU or Intel compiler for example work as well.

8.1.1 Collecting experiment data

The first step in profiling with the Sun Analyzer is to obtain experiement data. For thisyou must compile your code with the -g option. After that you can either run collect likethis$ collect a.out

or use the GUI.

To use the GUI to collect experiment data, start the analyzer (X11 forwarding must beenabled - $ analyzer), go to Collect Experiment under the File menu and select the Target,Working Directory and add Arguments if you need to. Click on Preview Command toview the command for collecting experiment data only. Now you can submit the command toa queue. Some example of scripts used to submit the command (the path to collect might bedifferent):

$ cat script.sh

#!/bin/bash

qsub -q [queue] -pe [pe] [np] -cwd -b y \

"/opt/sun/sunstudio12.1/prod/bin/collect -p high -M CT8.1 -S on -A on -L none \

mpirun -np 4 -- /path/to/file/test"

$ cat script_OMP_ONLY.sh

#!/bin/bash

qsub -q [queue] [pe] [np] -v OMP_NUM_THREADS=8 -cwd -b y \

"/opt/sun/sunstudio12.1/prod/bin/collect -L none \

-p high -S on -A on /path/to/file/test"

$ cat scriptOMP.sh

#!/bin/bash

qsub -q [queue] [pe] [np] -v OMP_NUM_THREADS=8 -cwd -b y \

"/opt/sun/sunstudio12.1/prod/bin/collect -p high -M CT8.1 -S on -A on -L none \

mpirun -np 4 -- /path/to/file/test"

75

The first one uses MPI tracing for testing MPI programs, the second one is intended forOpenmp programs (that’s why it sets the OMP NUM THREADS variable) and the last oneis for hybird programs (they use both MPI and Openmp). Some of the parameters used areexplained in the following. You can find more information in the manual ($ man collect).

• -M CT8.1; Specify collection of an MPI experiment. CT8.1 is the MPI version installed.

• -L size; Limit the amount of profiling and tracing data recorded to size megabytes. Nonemeans no limit.

• -S interval; Collect periodic samples at the interval specified (in seconds). on defaultsto 1 second.

• -A option; Control whether or not load objects used by the target process should bearchived or copied into the recorded experiment. on archive load objects into the ex-periment.

• -p option; Collect clock-based profiling data. high turn on clock-based profiling withthe default profiling interval of approximately 1 millisecond.

8.1.2 Viewing the experiment results

To view the results, open the analyzer and go to File - Open Experiment and select theexperiment you want to view. A very good tutorial for analyzing the data can be found here.Performance Analyzer MPI Tutorial is a good place to start.

The following screenshots where taken from the analysis of the programs presented in theParallelization section under Hybrid Parallelization.

MPI only version

76

Hybrid (MPI + Openmp) version

8.2 Intel MPI benchmark

The Intel MPI benchmark - IMB is a tool for evaluating the performance of a MPIinstallation. The idea of IMB is to provide a concise set of elementary MPI benchmarkkernels. With one executable, all of the supported benchmarks, or a subset specified by thecommand line, can be run. The rules, such as time measurement (including a repetitive call ofthe kernels for better clock synchronization), message lengths, selection of communicators torun a particular benchmark (inside the group of all started processes) are program parameters.

8.2.1 Installing and running IMB

The first step is to get the package from here. Unpack the archive. Make sure you havethe Intel compiler module loaded and a working OpenMPI installation. Go to the /imb/srcdirectory. There are three benchmarks available: IMB-MPI1, IMB-IO, IMB-EXT. You canbuild them separately with:

$ make <benchmark name>

or all at once with:$ make all

Now you can run any of the three benchmarks using:$ mpirun -np <nr_of_procs> IMB-xxx

NOTE: there are useful documents in the /imb/doc directory detailing the benchmarks.

77

8.2.2 Submitting a benchmark to a queue

You can also submit a benchmark to run on a queue. The following two scripts areexamples:

$ cat submit.sh

#!/bin/bash

qsub -q [queue] -pe [pe] [np] -cwd -b n [script_with_the_run_cmd]

$ cat script.sh

#!/bin/bash

mpirun -np [np] IMB-xxx

To submit you just have to run:$ ./submit.sh

After running the IMB-MPI1 benchmark on a queue with 24 processes the following resultwas obtained (only parts are shown):

#---------------------------------------------------

# Intel (R) MPI Benchmark Suite V3.2, MPI-1 part

#---------------------------------------------------

# Date : Thu Jul 23 16:37:23 2009

# Machine : x86_64

# System : Linux

# Release : 2.6.18-128.1.1.el5

# Version : #1 SMP Tue Feb 10 11:36:29 EST 2009

# MPI Version : 2.1

# MPI Thread Environment: MPI_THREAD_SINGLE

# Calling sequence was:

# IMB-MPI1

# Minimum message length in bytes: 0

# Maximum message length in bytes: 4194304

#

# MPI_Datatype : MPI_BYTE

# MPI_Datatype for reductions : MPI_FLOAT

# MPI_Op : MPI_SUM

#

#

# List of Benchmarks to run:

# PingPong

# PingPing

# Sendrecv

# Exchange

# Allreduce

# Reduce

# Reduce_scatter

# Allgather

# Allgatherv

78

# Gather

# Gatherv

# Scatter

# Scatterv

# Alltoall

# Alltoallv

# Bcast

# Barrier

[...]

#-----------------------------------------------------------------------------

# Benchmarking Exchange

# #processes = 2

# ( 22 additional processes waiting in MPI_Barrier)

#-----------------------------------------------------------------------------

#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] Mbytes/sec

524288 80 1599.30 1599.31 1599.31 1250.54

1048576 40 3743.45 3743.48 3743.46 1068.53

2097152 20 7290.26 7290.30 7290.28 1097.35

4194304 10 15406.39 15406.70 15406.55 1038.51

[...]

#-----------------------------------------------------------------------------

# Benchmarking Exchange

# #processes = 24

#-----------------------------------------------------------------------------

#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] Mbytes/sec

0 1000 75.89 76.31 76.07 0.00

1 1000 67.73 68.26 68.00 0.06

2 1000 68.47 69.29 68.90 0.11

4 1000 69.23 69.88 69.57 0.22

8 1000 68.20 68.91 68.55 0.44

262144 160 19272.77 20713.69 20165.05 48.28

524288 80 63144.46 65858.79 63997.79 30.37

1048576 40 83868.32 89965.37 87337.56 44.46

2097152 20 91448.50 106147.55 99928.08 75.37

4194304 10 121632.81 192385.91 161055.82 83.17

[...]

#----------------------------------------------------------------

# Benchmarking Alltoallv

# #processes = 8

# ( 16 additional processes waiting in MPI_Barrier)

#----------------------------------------------------------------

#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]

0 1000 0.10 0.10 0.10

1 1000 18.49 18.50 18.49

2 1000 18.50 18.52 18.51

4 1000 18.47 18.48 18.47

8 1000 18.40 18.40 18.40

16 1000 18.42 18.43 18.43

32 1000 18.89 18.90 18.89

79

68 1000 601.29 601.36 601.33

65536 640 1284.44 1284.71 1284.57

131072 320 3936.76 3937.16 3937.01

262144 160 10745.08 10746.09 10745.83

524288 80 22101.26 22103.33 22102.58

1048576 40 44044.33 44056.68 44052.76

2097152 20 88028.00 88041.70 88037.15

4194304 10 175437.78 175766.59 175671.63

[...]

#----------------------------------------------------------------

# Benchmarking Alltoallv

# #processes = 24

#----------------------------------------------------------------

#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]

0 1000 0.18 0.22 0.18

1 1000 891.94 892.74 892.58

2 1000 891.63 892.46 892.28

4 1000 879.25 880.09 879.94

8 1000 898.30 899.29 899.05

262144 15 923459.34 950393.47 938204.26

524288 10 1176375.79 1248858.31 1207359.81

1048576 6 1787152.85 1906829.99 1858522.38

2097152 4 3093715.25 3312132.72 3203840.16

4194304 2 5398282.53 5869063.97 5702468.73

As you can see, if you specify 24 processes then the benchmark will also run the 2,4,8,16tests. You can fix the minimum number of processes to use with:

$ mpirun [...] <benchmark> -npmin <minimum_number_of_procs>

NOTE: Other useful control commands can be found in the Users Guide (/doc directory)under section 5.

8.3 Paraver and Extrae

8.3.1 Local deployment - Installing

PAPI 4.2.0You should use a kernel with version >= 2.6.33. For versions <= 2.6.32 perfctr patches

are required in the kernel or you have to specify –with-perf-events to the configure script ifyou have support for perf events in your kernel, I used 2.6.35.30 on 64bits/ From the rootfolder of the svn repo:

cd papi-4.2.0/src

./configure --prefix=/usr/local/share/papi/

make

sudo make install-all

PAPI 4.2.0 can also be downloaded from here: http://icl.cs.utk.edu/papi/software/OpenMPI 1.4.4From the root folder of the svn repo:

80

cd openmpi-1.4.4/

./configure --prefix /usr/local/share/openmpi

sudo make all install

OpenMPI 1.4.4 can also be downloaded from here:http://www.open-mpi.org/software/ompi/v1.4/

Extrae 2.2.0From the root folder of the svn repo:

cd extrae-2.2.0/

./configure --with-mpi=/usr/local/share/openmpi \

--with-papi=/usr/local/share/papi \

--enable-posix-clock --without-unwind --without-dyninst \

--prefix=/usr/local/share/extrae

make

sudo make install

Extrae 2.2.0 can also be downloaded from here:http://www.bsc.es/ssl/apps/performanceTools/

Obtaining tracesFrom the root folder of the svn repo:

cd acoustic_with_extrae/

Open extrae.xml. We have a few absolute paths that we need to change

in this file so that tracing will work correctly. Search for vlad.

There should be 3 occurrences in the file. Modify the paths you

find with vlad by replacing /home/vlad/dimemas_paraver_svn with

the path to your local copy of the svn repo.

make

./run_ldpreload.sh 3 (3 is the number of MPI processes)

Warning: the acoustic workload generates 950MB of output for this

run. All output files are located in the export folder. Please

make sure you have enough free space.

Extrae produces tracing files for each MPI Process. The files

are located in the trace folder. The trace/set-0 folder will

contain 3 files, one for each MPI process, which are merged

in the final line of the run_ldpreload.sh script. Each .mpits

file has between 20 and 30 MB for this run.

After the script finishes you should find 3 files like this:

EXTRAE_Paraver_trace.prv

EXTRAE_Paraver_trace.row

EXTRAE_Paraver_trace.pcf

We’re interested in the .prv file (this contains the tracing info).

81

8.3.2 Deployment on NCIT cluster

8.3.3 Installing

Extrae, PAPI and OpenMPI are already installed on the cluster. PAPI version is 4.2.0.Extrae version is 2.2.0. Extrae is configured to work with OpenMPI version 1.5.3 for gcc 4.4.0.The moduleavail command will offer more information about what each modules.

8.3.4 Checking for Extrae installation

At the moment when this documentation was written Extrae was not installed on all thenodes of the Opteron queue. We have created a script that checks on which nodes Extrae wasinstalled. Copy the check extrae install folder on fep.grid.pub.ro and then:

./qsub.sh

This will submit a job on each of the 14 nodes of the Opteron

queue which will check for the Extrae installation.

After all the jobs finish:

cat check_extrae.sh.o*

The output should be something like this:

...

opteron-wn10.grid.pub.ro 1




...

When obtaining traces on the NCIT cluster we chose to use a single node. After Extrae isinstalled on all the Opteron nodes this validation step will become unnecessary. Please don’tleave any jobs stuck in the queue waiting state.

Obtaining traces

Copy the acoustic_with_extrae_fep folder in your home on fep.grid.pub.ro.

Change the paths in the extrae.xml file to match the paths in which you wish

to collect the trace information. See section Local Deployment, subsection

Obtaining traces for more infomation.

. load_modules.sh

This loads the gcc, openmpi and extrae modules

Be sure to use . and not ./

make

./qsub.sh 6 # 6 in this case is the number of MPI Processes. Wait for

the job to finish. As you can see in the qsub.sh script the jobs are

ran on a single node: opteron-wn10.grid.pub.ro. Once extrae is

installed all the nodes in the Opteron queue you can just send the

job on any nodes in the queue not just a subset of nodes.

After the job finishes running:

82

./merge_mpits.sh

This will merge the .mpits file in a single .prv file which you can

load into Paraver.

8.3.5 Visualization with Paraver

The svn repo contains a 64-bit version of Paraver. If your OS is 32-bit please download anappropriate version from here: http://www.bsc.es/ssl/apps/performanceTools/

From the root folder of the svn repo:

cd wxparaver64/bin/

export PARAVER_HOME=../

./wxparaver

File --> Load Trace

Load the previously generated .prv file

File --> Load Configuration

Load one of the configurations from intro2paraver_MPI/cfgs/. Double clicking

the pretty colored output will make a window pop up with information

regarding the output (what each color represents). A .doc file is located

in the intro2paraver_MPI folder which explains Paraver usage with the

provided configurations.

A larger number of configurations (258 possible configurations)

exists in the wxparaver64/cfgs folder.

8.3.6 Do it yourself tracing on the NCIT Cluster

In order to trace a new C program you need to take the following files from the acousticsample:

• extrae.xml

• load modules.sh

• Makefile

• merge mpits.sh

• qsub.sh

• run ldpreload.sh

and copy them to your source folder. The Makefile and the run ldpreload.sh script shouldbe changed accordingly to match you source file hierarchy and the name of the binary. Thechanges that need to be made are minor (location of C source files, name of the binary, linkingwith extra libraries)

8.3.7 Observations

For some of the events data won’t be collected because support is missing in the kernel.Patches for the perfctr syscall should be added to the kernel to collect hardware counter data.This won’t be a problem on newer kernels (local testing) since with kernels >= 2.6.32 PAPI

83

uses the perf events infrastructure to collect hardware counter data. More information on theperfctr patches can be found here:

https://ncit-cluster.grid.pub.ro/trac/HPC2011/wiki/Dimemas.A more in-depth user guide for Extrae can be found here:http://www.bsc.es/ssl/apps/performanceTools/files/docs/extrae-userguide.pdf.It covers a number of aspects in detail. Of interest are customizing the extrae.xml file and

different ways of using Extrae to obtain trace data.Warning: The XML files we provided log all the events and will generate a lot of output.

The acoustic example we provided has a fairly short run time, but for long running jobs asignificant amount of data will be collected. We recommend setting up the trace files onLustreFS and not on NFS. Also, we recommend customizing the XML file so that only asubset of events will be logged. In order to limit the amount of data being logged and thenumber of events being handled please consult the Extrae User Guide.

Limitation: The files we provided handle tracing MPI programs and not OpenMP programsor hybrid MPI - OpenMP programs. Although the example we provided is a hybrid programwe have taken out the OpenMP part by setting the number of OpenMP threads to 1 inthe input file. Future work on this project should also add scripts and makefiles for tracingOpenMP and hybrid programs.

8.4 Scalasca

Scalasca is a performance toolset that has been specifically designed to analyze parallelapplication execution behavior on large-scale systems. It offers an incremental performanceanalysis procedure that integrates runtime summaries with in-depth studies of concurrentbehavior via event tracing, adopting a strategy of successively refined measurement configu-rations. Distinctive features are its ability to identify wait states in applications with verylarge numbers of processes and combine these with efficiently summarized local measurements.Scalasca is a software tool that supports the performance optimization of parallel programsby measuring and analyzing their runtime behavior. Scalasca supports two different analysismodes that rely on either profiling or event tracing. In profiling mode, Scalasca generates ag-gregate performance metrics for individual function call paths, which are useful to identify themost resource-intensive parts of the program and to analyze process-local metrics such as thosederived from hardware counters. In tracing mode it records individual performance-relevantevents, allowing the automatic identification of call paths that exhibit wait states.

8.4.1 Installing Scalasca

Before installing Scalasca some prerequisites need to be satisfied:

• Gnu make

• Qt version at least 4.2 (qmake)

• Cube3 (performance report visual explorer) - can be downloaded from the same site(www.scalasca.org)

• fortran77 / fortran95 (gfortran will suffice)

After that, you can run using root privilages the following commands:

./configure -prefix=/opt/tools/scalasca-1.3.2

make

84

make install

./configure -prefix=/opt/tools/cube-3.3.1

make

make install

8.4.2 Running experiments

Runing on cluster:

module load utilities/scalasca-1.4.1-gcc-4.6.3

Insert in the "Makefile" that compiles the test application the following command:

"scalasca -instrument mpicc -O3 -lm myprog.o myprog.c -o myprog".

In the "mprun.sh" script file you have to put in the MODULE section the command "compile

After that, all you have to do is to run "mprun.sh"

After that, all you have to do is to run mprun.sh. After the end of the execution data iscollected in a folder called epik [applicationName] [numProcesses].

Runtime measurement collection and analysis

The Scalasca measurement collection and analysis nexus accessed through the scalasca -analyze

command integrates the following three steps:

• Instrumentation

• Runtime measurement and collection of data

• Analysis and interpretation

Instrumentation

First of all, in order to run profiling experiments using Scalasca, application that use MPIor OpenMP (or both) must have their code modified before execution. This modification isdone using Scalasca and it consists of inserting some specific measurement calls for importantpoints (events) of the applications’ runtime.

All the necessary instrumentation is automatic and the user, OpenMP and MPI functionsare handled by the Scalasca instrumenter, which is called using the scalasca -instrument

command. All the compile and link commands for the modules of the application containingOpenMP and/or MPI code must be prefixed by the scalasca -instrument command (this alsoneeds to be added in the Makefile of the application). An example for the command use is:

scalasca -instrument mpicc myprog.c -o myprog

Although generally more convenient, automatic function instrumentation may result in toomany and/or too disruptive measurements, which can be addressed with selective instrumen-tation and measurement filtering. On supercomputing systems, users usually have to submittheir jobs to a batch system and are not allowed to start parallel jobs directly. Therefore, thecall to the scalasca command has to be provided within a batch script, which will be scheduledfor execution when the required resources are available. The syntax of the batch script differsbetween the different scheduling systems. However, common to every batch script format is apassage where all shell commands can be placed that will be executed.

85

Runtime measurement and collection of data

This stage follows compilation and instrumentation and it is responsible for managingthe configuration and processing of performance experiments. The tool used in this stage,referred to by its’ creators as Scalasca measurement collection and analysis nexus - SCAN, isresponsible for several features:

- measurement configuration - configures metrics; filters uninteresting functions, methodsand subroutines; supports selective event generation.

- application execution - using the specified application launcher (e.g. mpiexec or mpirunfor MPI implementations of instrumented executables)

- collection of data - stores data for later analysis in a folder named epik ¡applicationName¿ ¡numberOfPThis step is done by running the command scalasca -analyze followed by the application

executable launcher (if one is needed - as is the case with MPI) together with its’ argumentsand flags, the target executable and the target’s arguments. An example for the use of thiscommand is: scalasca -analyze mpirun -np 4 myprog

Post-processing is done the first time that an archive is examined, before launching theCUBE3 report viewer. If the scalasca -examine command is executed on an already pro-cessed experiment archive, or with a CUBE file specified as argument, the viewer is launchedimmediately.

An example for using this command is:

scalasca -examine epik_myprog_4x0_sum

Analysis and interpretation

The results of the previous phase are saved, as mentioned, in a folder (by default, as a sub-folder of the experiment folder) named epik ¡applicationName¿ ¡numberOfProcesses¿ which isthe report for the previously analyzed experiment. This report needs post-processing beforeany results can be visualized and studied, and this process is only done the first time whenit is examined, by using the command scalasca -examine. Post-processing is done the firsttime that an archive is examined, before launching the CUBE3 report viewer. If the scalasca-examine command is executed on an already processed experiment archive, or with a CUBEfile specified as argument, the viewer is launched immediately.

A short textual score report can be obtained without launching the viewer: scalasca -examine -s epik ¡title¿. This score report comes from the cube3 score utility and provides abreakdown of the different types of region included in the measurement and their estimatedassociated trace buffer capacity requirements, aggregate trace size (total tbc) and largestprocess trace size (max tbc), which can be used to specify an appropriate ELG BUFFER SIZEfor a subsequent trace measurement. No post-processing is performed in this case, so thatonly a subset of Scalasca analyses and metrics may be shown.

Using Cube3

CUBE3 is a generic user interface for presenting and browsing performance and debugginginformation from parallel applications. The CUBE3 main window consists of three panelscontaining tree displays or alternate graphical views of analysis reports. The left panel showsperformance properties of the execution, the middle pane shows the call-tree or a flat profile ofthe application, and the right tree either shows the system hierarchy consisting of machines,compute nodes, processes, and threads or a topological view of the application’s processes andthreads. All tree nodes are labeled with a metric value and a colored box which can helpidentify hotspots. The metric value color is determined from the proportion of the total (root)value or some other specified reference value.

86

A click on a performance property or a call path selects the corresponding node. This hasthe effect that the metric value held by this node (such as execution time) will be furtherbroken down into its constituents. That is, after selecting a performance property, the middlepanel shows its distribution across the call tree. After selecting a call path (i.e., a node inthe call tree), the system tree shows the distribution of the performance property in that callpath across the system locations. A click on the icon left to a node in each tree expands orcollapses that node. By expanding or collapsing nodes in each of the three trees, the analysisresults can be viewed on different levels of granularity.

During trace collection, information about the application’s execution behavior is recordedin so called event streams. The number of events in the streams determines the size of thebuffer required to hold the stream in memory. To minimize the amount of memory required,and to reduce the time to flush the event buffers to disk, only the most relevant functioncalls should be instrumented. When the complete event stream is larger than the memorybuffer, it has to be flushed to disk during application runtime. This flush impacts applicationperformance, as flushing is not coordinated between processes, and runtime imbalances areinduced into the measurement. The Scalasca measurement system uses a default value of 10MB per process or thread for the event trace: when this is not adequate it can be adjusted tominimize or eliminate flushing of the internal buffers. However, if too large a value is specifiedfor the buffers, the application may be left with insufficient memory to run, or run adverselywith paging to disk. Larger traces also require more disk space (at least temporarily, untilanalysis is complete), and are correspondingly slower to write to and read back from disk.Often it is more appropriate to reduce the size of the trace (e.g., by specifying a shorterexecution, or more selective instrumentation and measurement), than to increase the buffersize.

Conclusions

Debugging a parallel application is a difficult task and tools like Scalasca are very usefulwhenever the behavior of running applications isn’t the one we have expected. Our generalapproach is to first observe parallel execution behavior on a coarse-grained level and then tosuccessively refine the measurement focus as new performance knowledge becomes available.

87

Future enhancements will aim at both further improving the functionality and scalability ofthe SCALASCA toolset. Completing support for OpenMP and the missing features of MPIto eliminate the need for sequential trace analysis is a primary development objective. Usingmore flexible measurement control, we are striving to offer more targeted trace collection mech-anisms, reducing memory and disk space requirements while retaining the value of tracebasedin-depth analysis. In addition, while the current measurement and trace analysis mechanismsare already very powerful in terms of the number of application processes they support, we areworking on optimized data management and workflows that will allow us to master even largerconfigurations. These might include truly parallel identifier unification, trace analysis withoutfile I/O, and using parallel I/O to write analysis reports. Since parallel simulations are ofteniterative in nature, and individual iterations can differ in their performance characteristics,another focus of our research is therefore to study the temporal evolution of the performancebehavior as a computation progresses.

88

9 Application Software and Program Libraries

9.1 Automatically Tuned Linear Algebra Software (ATLAS)

The BLAS (Basic Linear Algebra Subprograms) are routines that provide standard buildingblocks for performing basic vector and matrix operations. The Level 1 BLAS perform scalar,vector and vector-vector operations, the Level 2 BLAS perform matrix-vector operations, andthe Level 3 BLAS perform matrix-matrix operations. Because the BLAS are efficient, portable,and widely available, they are commonly used in the development of high quality linear algebrasoftware, LAPACK for example.

The ATLAS (Automatically Tuned Linear Algebra Software) project is an ongoing researcheffort that provides C and Fortran77 interfaces to a portably efficient BLAS implementation, aswell as a few routines from /htmladdnormallinkLAPACKhttp://math-atlas.sourceforge.net/

9.1.1 Using ATLAS

To initialize the environment use:

• module load blas/atlas-9.11 sunstudio12.1 (compiled with gcc)

• module load blas/atlas-9.11 sunstudio12.1 (compiled with sun)

To use level1-3 functions available in atlas see the prototypes functions in cblas.h and usethem in your examples. For compiling you should specifiy the necessary libraries files.

Example:

• gcc -lcblas -latlas example.c

• cc -lcblas -latlas example.c

It is recommended the version compiled with gcc.It is almost never a good idea to changethe C compiler used to compile ATLAS’s generated double precision (real and complex) and Ccompiler used to compile ATLAS’s generated single precision (real and complex), and it is onlyvery rarely a good idea to change the C compiler used to compile all other double precisionroutines and C compiler used to compile all other single precision routines . For ATLAS3.8.0, all architectural defaults are set using gcc 4.2 only (the one exception is MIPS/IRIX,where SGI’s compiler is used). In most cases, switching these compilers will get you worseperformance and accuracy, even when you are absolutely sure it is a better compiler and flagcombination!

89

9.1.2 Performance

90

9.2 MKL - Intel Math Kernel Library

Intel Math Kernel Library (Intel MKL) is a library of highly optimized, extensively threadedmath routines for science, engineering, and financial applications that require maximum per-formance. Core math functions include BLAS, LAPACK, ScaLAPACK, Sparse Solvers, FastFourier Transforms, Vector Math, and more. Offering performance optimizations for currentand next-generation Intel processors, it includes improved integration with Microsoft Visual

91

Studio, Eclipse, and XCode. Intel MKL allows for full integration of the Intel CompatibilityOpenMP run-time library for greater Windows/Linux cross-platform compatibility.

9.2.1 Using MKL

To initialize the environment use:

• module load blas/mkl-10.2

To compile with gcc an example that uses mkl functions you should specify necessarylibraries.

Example:

• gcc -lmkl intel lp64 -lmkl intel thread -lmkl core -liomp5 -lpthread -lm example.c

9.2.2 Performance

92

9.3 ATLAS vs MKL - level 1,2,3 functions

The BLAS functions were tested for all 3 levels and the results are shown only for level 3.To summarize the performance tests, it would be that ATLAS loses for Level 1 BLAS, tendsto be beat MKL for Level 2 BLAS, and varies between quite a bit slower and quite a bit fasterthan MKL for Level 3 BLAS, depending on problem size and data type.

ATLAS’s present Level 1 gets its optimization mainly from the compiler. This gives MKLtwo huge advantages: MKL can use the SSE prefetch instructions to speed up pretty muchall Level 1 ops. The second advantage is in how ABS() is done. ABS() *should* be a 1-cycleoperation, since you can just mask off the sign bit. However, you cannot standardly do bitoperation on floats in ANSI C, so ATLAS has to use an if-type construct instead. This spellsabsolute doom for the performance of NRM2, ASUM and AMAX.

For the Level 2 and 3, ATLAS has it’s usual advantage of leveraging basic kernels to themaximum. This means that all Level 3 ops follow the performance of GEMM, and Level 2ops follow GER or GEMV. MKL has the usual disadvantage of optimizing all these routinesseperately, leading to widely varying performance.

9.4 Scilab

Scilab is a programming language associated with a rich collection of numerical algorithmscovering many aspects of scientic computing problems. From the software point of view, Scilabis an interpreted language. This generally allows to get faster development processes, becausethe user directly accesses a high-level language, with a rich set of features provided by thelibrary. The Scilab language is meant to be extended so that user-dened data types can bedened with possibly overloaded operations. Scilab users can develop their own modules so thatthey can solve their particular problems. The Scilab language allows to dynamically compileand link other languages such as Fortran and C: this way, external libraries can be used as ifthey were a part of Scilab built-in features.

From the scientific point of view, Scilab comes with many features. At the very beginning of

94

Scilab, features were focused on linear algebra. But, rapidly, the number of features extendedto cover many areas of scientific computing. The following is a short list of its capabilities:

• Linear algebra, sparse matrices,

• Polynomials and rational functions,

• Interpolation, approximation,

• Linear, quadratic and non linear optimization,

• Ordinary Differential Equation solver and Differential Algebraic Equations solver,

• Classic and robust control, Linear Matrix Inequality optimization, Differentiable andnon-differentiable optimization,

• Signal processing,

• Statistics.

Scilab provides many graphics features, including a set of plotting functions, which allowto create 2D and 3D plots as well as user interfaces. The Xcos environment provides a hybriddynamic systems modeler and simulator.

9.4.1 Sour code and compilation

The source code for Scilab can be found at:

• via git protocol: git clone git://git.scilab.org/scilab

• via http protocol: git clone http://git.scilab.org/scilab.git

Compilig from source code The source code can be compiled from the source code. Inorder to do that, we issued the following commands:

module load compilers/gcc-4.6.0

module load java/jdk1.6.0_23-64bit

module load blas/atlas-9.11_gcc

export PATH=$PATH:/export/home/ncit-cluster/username/scilab-req/apache-ant-1.8.2/bin

./configure \

--without-gui \

--without-hdf5 \

--disable-build-localisation \

--with-libxml2=/export/home/ncit-cluster/username/scilab-req/libxml2-2.7.8 \

--with-pcre=/export/home/ncit-cluster/username/scilab-req/pcre-8.20 \

--with-lapack-library=/export/home/ncit-cluster/username/scilab-req/lapack-3.4.0 \

--with-umfpack-library=/export/home/ncit-cluster/username/scilab-req/UMFPACK/Lib \

--with-umfpack-include=/export/home/ncit-cluster/username/scilab-req/UMFPACK/Include

95

9.4.2 Using Scilab

In this section, we make our frst steps with Scilab and present some simple tasks wecan perform with the interpreter. There are several ways of using Scilab and the followingparagraphs present three methods:

• using the console in the interactive mode

• using the exec function against a file

• using batch processing

The console The first way is to use Scilab interactively, by typing commands in the console,analyzing the results and continuing this process until the final result is computed. Thisdocument is designed so that the Scilab examples which are printed here can be copied intothe console. The goal is that the reader can experiment by himself Scilab behavior. Thisis indeed a good way of understanding the behavior of the program and, most of the time,it allows a quick and smooth way of performing the desired computation. In the followingexample, the function disp is used in the interactive mode to print out the string ”HelloWorld!”.

-->s=" Hello World !"

s =

Hello World !

-->disp (s)

Hello World !

In the previous session, we did not type the characters ”–>”which is the prompt, and whichis managed by Scilab. We only type the statement s=”Hello World!” with our keyboard andthen hit the <Enter >key. Scilab answer is s = and Hello World!. Then we type disp(s) andScilab answer is Hello World!.

When we edit a command, we can use the keyboard, as with a regular editor. We can usethe left and right ! arrows in order to move the cursor on the line and use the <Backspace>and<Suppr >keys in order to fix errors in the text. In order to get access to previously executedcommands, use the up arrow key. This allows to browse the previous commands by using theup and down arrow keys.

The <Tab >key provides a very convenient completion feature. In the following session,we type the statement disp in the console.

-->disp

The editor can be accessed from the menu of the console, under the Applications ¿ Editormenu, or from the console, as presented in the following session.

--> editor ()

This editor allows to manage several files at the same time. There are many features whichare worth to mention in this editor. The most commonly used features are under the Executemenu.

• Load into Scilab allows to execute the statements in the current file, as if we did a copyand paste. This implies that the statements which do not end with the semicolon ”;”character will produce an output in the console.

96

• Evaluate Selection allows to execute the statements which are currently selected.

• Execute File Into Scilab allows to execute the file, as if we used the exec function.The results which are produced in the console are only those which are associated withprinting functions, such as disp for example.

We can also select a few lines in the script, right click (or Cmd+Click under Mac), andget the context menu. The Edit menu provides a very interesting feature, commonly knownas a ”pretty printer” in most languages. This is the Edit ¿ Correct Indentation feature, whichautomatically indents the current selection. This feature is extremelly convenient, as it allowsto format algorithms, so that the if, for and other structured blocks are easy to analyze.

The editor provides a fast access to the inline help. Indeed, assume that we have selectedthe disp statement, as presented in figure 7. When we right-click in the editor, we get thecontext menu, where the Help about ”disp” entry allows to open the help page associatedwith the disp function.

The graphics in Scilab version 5 has been updated so that many components are nowbased on Java. This has a number of advantages, including the possibility to manage dockingwindows.

The docking system uses Flexdock, an open-source project providing a Swing dockingframework. Assume that we have both the console and the editor opened in our environment.It might be annoying to manage two windows, because one may hide the other, so that weconstantly have to move them around in order to actually see what happens. The Flexdocksystem allows to drag and drop the editor into the console, so that we finally have only onewindow, with several sub-windows. All Scilab windows are dockable, including the console,the editor, the help and the plotting windows.

In order to dock one window into another window, we must drag and drop the sourcewindow into the target window. To do this, we left-click on the title bar of the dockingwindow. Before releasing the click, let us move the mouse over the target window and noticethat a window, surrounded by dotted lines is displayed. This ”phantom” window indicatesthe location of the future docked window. We can choose this location, which can be on thetop, the bottom, the left or the right of the target window. Once we have chosen the targetlocation, we release the click, which finally moves the source window into the target window.We can also release the source window over the target window, which creates tabs.

Using exec When several commands are to be executed, it may be more convenient to writethese statements into a file with Scilab editor. To execute the commands located in such afile, the exec function can be used, followed by the name of the script. This file generallyhas the extension .sce or .sci, depending on its content: files having the .sci extension containScilab functions and executing them loads the functions into Scilab environment (but does notexecute them), files having the .sce extension contain both Scilab functions and executablestatements. Executing a .sce file has generally an effect such as computing several variablesand displaying the results in the console, creating 2D plots, reading or writing into a file, etc...Assume that the content of the file myscript.sce is the following.

disp("Hello World !")

In the Scilab console, we can use the exec function to execute the content of this script.

-->exec (" myscript .sce")

-->disp (" Hello World !")

Hello World !

In practical situations, such as debugging a complicated algorithm, the interactive modeis used most of the time with a sequence of calls to the exec and disp functions.

97

Batch processing Another way of using Scilab is from the command line. Several commandline options are available. Whatever the operating system is, binaries are located in thedirectory scilab-5.2.0/bin. Command line options must be appended to the binary for thespecific platform, as described below. The -nw option allows to disable the display of theconsole. The -nwni option allows to launch the non-graphics mode: in this mode, the consoleis not displayed and plotting functions are disabled (using them will generate an error).

9.4.3 Basic elements of the language

In this section, we present the basic features of the language, that is, we show how tocreate a real variable, and what elementary mathematical functions can be applied to a realvariable. If Scilab provided only these features, it would only be a super desktop calculator.Fortunately, it is a lot more and this is the subject of the remaining sections, where we willshow how to manage other types of variables, that is booleans, complex numbers, integers andstrings. It seems strange at first, but it is worth to state it right from the start: in Scilab,everything is a matrix. To be more accurate, we should write: all real, complex, boolean,integer, string and polynomial variables are matrices. Lists and other complex data structures(such as tlists and mlists) are not matrices (but can contain matrices). These complex datastructures will not be presented in this document. This is why we could begin by presentingmatrices. Still, we choose to present basic data types first, because Scilab matrices are in facta special organization of these basic building blocks.

Creating real variables In this section, we create real variables and perform simple oper-ations with them. Scilab is an interpreted language, which implies that there is no need todeclare a variable before using it. Variables are created at the moment where they are firstset.

In the following example, we create and set the real variable x to 1 and perform a multi-plication on this variable. In Scilab, the ”=” operator means that we want to set the variableon the left hand side to the value associated with the right hand side (it is not the comparisonoperator, which syntax is associated with the ” == ” operator).

-->x=1

x = 1.

-->x = x * 2

x = 2.

The value of the variable is displayed each time a statement is executed. That behavior canbe suppressed if the line ends with the semicolon ”;” character, as in the following example.

-->y=1;

-->y=y*2;

Elementary mathematical functions In the following example, we use the cos and sinfunctions :

-->x = cos (2)

x =- 0.4161468

-->y = sin (2)

y = 0.9092974

-->x^2+y^2

ans = 1.

98

Complex Numbers Scilab provides complex numbers, which are stored as pairs of floatingpoint numbers. The predefined variable i represents the mathematical imaginary number iwhich satisfies i2 = −1. All elementary functions previously presented before, such as sin forexample, are overloaded for complex numbers. This means that, if their input argument is acomplex number, the output is a complex number. Figure 17 presents functions which allowto manage complex numbers. In the following example, we set the variable x to 1 + i, andperform several basic operations on it, such as retrieving its real and imaginary parts. Noticehow the single quote operator, denoted by ”’ ”, is used to compute the conjugate of a complexnumber. We finally check that the equality (1 + i)(1 − i) = 1 − i2 = 2 is verified by Scilab.We finally check that the equality (1 + i)(1− i) = 1− i2 = 2 is verified by Scilab.

-->x*y

ans = 2.

Strings Strings can be stored in variables, provided that they are delimited by double quotes”” ”. The concatenation operation is available from the ”+” operator. In the following Scilabsession, we define two strings and then concatenate them with the ”+” operator.

-->x = "foo"

x = foo

-->y = "bar"

y = bar

-->x+y

ans = foobar

Dynamic type of variables When we create and manage variables, Scilab allows to changethe type of a variable dynamically. This means that we can create a real value, and then puta string variable in it, as presented in the following session.

-->x=1

x =

1.

-->x+1

ans =

2.

-->x="foo"

x =

foo

-->x+"bar"

ans =

foobar

We emphasize here that Scilab is not a typed language, that is, we do not have to declarethe type of a variable before setting its content. Moreover, the type of a variable can changeduring the life of the variable.

9.5 Deal II

9.5.1 Introduction

Deal.II is a C++ program library targeted at the computational solution of partial differen-tial equations using adaptive finite elements. It uses state-of- the-art programming techniquesto offer you a modern interface to the complex data structures and algorithms required.

99

9.5.2 Description

The main aim of deal.II is to enable rapid development of modern finite element codes,using among other aspects adaptive meshes and a wide array of tools classes often used infinite element program. Writing such programs is a non-trivial task, and successful programstend to become very large and complex. We believe that this is best done using a programlibrary that takes care of the details of grid handling and refinement, handling of degrees offreedom, input of meshes and output of results in graphics formats, and the like. Likewise,support for several space dimensions at once is included in a way such that programs can bewritten independent of the space dimension without unreasonable penalties on run-time andmemory consumption.

9.5.3 Installation

The first step is to get the library package from here

wget http://www.dealii.org/download/deal.II-7.1.0.tar.gz

9.5.4 Unpacking

The library comes in a tar.gz archive that we must unzip with the following commands:

gunzip deal.II-X.Y.Z.tar.gz

tar xf deal.II-X.Y.Z.tar

9.5.5 Configuration

The library has a configuration script that we must run before the installation.

./configure creates the file deal.II/common/Make.global options,

which remembers paths and configuration options. You can call:

make -j16 target-name

to let make call multiple instances of the compiler (in this case sixteen)You can give several flags to ./configure:

• –enable-shared = saves disk space, link time and start-up time, so this is the default

• –enable-threads = The default is to use multiple threads

• –enable-mpi = If these compilers exist and indeed support MPI, then this also switcheson support for MPI in the library.

• –with-petsc=DIR and –with-petsc-arch=ARCH switches to ./configure can be used tooverride the values of PETSC DIR and PETSC ARCH or if these environment variablesare not set at all.

• –with-metis-include, –with-metis-libs.

In order to configure the installation with the complete features of the deal II library, we havefirst got to install some other required libraries:

• PETSC (Portable, Extensible Toolkit for Scientific Computation)

• ATLAS (Automatically Tuned Linear Algebra Software) used for compiling PETSClibrary

• Metis (Graph Partioning, Mesh Partitioning, Matrix Reordering) that provides variousmethods to partition graphs

100

Environment libraries We have several OpenMPI implementations. To use the one suit-able for all the libraries we must load the apropriate module.

Loading GCC

module load compilers/gcc-4.6.0

Loading MPI

module load mpi/openmpi-1.5.3 gcc-4.6.0

Installing PETSC In order to compile PETSC we must first have a version of ATLAS thatwe can use. We have chosen PETSC version petsc-3.2-p5:

wget http://ftp.mcs.anl.gov/pub/petsc/release-snapshots/petsc-3.2-p5.tar.gz

gunzip petsc-3.2-p5.tar.gz

tar -xof petsc-3.2-p5.tar

cd petsc-3.2-p5

First we need to get the ATLAS library package and unpack it:

wget atlas3.9.51.tar.bz2

bunzip2 atlas3.9.51.tar.bz2

tar -xof atlas3.9.51.tar

mv ATLAS ATLAS3.9.51

cd ATLAS3.9.51

Now we need to configure the ATLAS library for the current machine, by creating the workingdirectories for the installation:

mkdir Linux C2D64SSE3

cd Linux C2D64SSE3

mkdir /export/home/ncit-cluster/stud/g/george.neagoe/hpc/lib/atlas

Running the configuration script of ATLAS library:

../configure -b 64 -D c -DPentiumCPS=2400

--prefix=/export/home/ncit-cluster/stud/g/george.neagoe/hpc/lib/atlas

-Fa alg -fPIC

We must use the flags -Fa alg -fPIC because of an error that we had ecountering oncompiling PETSC library:

liblapack.a(dgeqrf.o):

relocation R X86 64 32 against ‘a local symbol’

can not be used when making a shared object;

recompile with -fPIC

The script must have an output like the following one:

cp /export/home/ncit-cluster/stud/g/george.neagoe/hpc/ATLAS3.9.51/Linux C2D64SSE3/

..//makes/Make.l3thr src/threads/blas/level3/Makefile


..//makes/Make.l2thr src/threads/blas/level2/Makefile


..//makes/Make.l3ptblas src/pthreads/blas/level3/Makefile


..//makes/Make.dummy src/pthreads/blas/level2/Makefile

101


..//makes/Make.dummy src/pthreads/blas/level1/Makefile


..//makes/Make.miptblas src/pthreads/misc/Makefile


..//makes/Make.pkl3 src/blas/pklevel3/Makefile


..//makes/Make.gpmm src/blas/pklevel3/gpmm/Makefile


..//makes/Make.sprk src/blas/pklevel3/sprk/Makefile


..//makes/Make.l3 src/blas/level3/Makefile


..//makes/Make.l3aux src/blas/level3/rblas/Makefile


..//makes/Make.l3kern src/blas/level3/kernel/Makefile


..//makes/atlas trsmNB.h include/.


..//CONFIG/ARCHS/Makefile ARCHS/.

make[2]: warning: Clock skew detected. Your build may be incomplete.

make[2]: Leaving directory

‘/export/home/ncit-cluster/stud/g/george.neagoe/hpc/ATLAS3.9.51/Linux C2D64SSE3’

make[1]: warning: Clock skew detected. Your build may be incomplete.

make[1]: Leaving directory

‘/export/home/ncit-cluster/stud/g/george.neagoe/hpc/ATLAS3.9.51/Linux C2D64SSE3’

make: warning: Clock skew detected. Your build may be incomplete.

DONE configure

Building ATLAS library after the configuration.

make build

make check

make time

make install

Configuring PETSC:

cd petsc-3.2-p5/

./configure

--with-blas-lapack-dir=

/export/home/ncit-cluster/stud/g/george.neagoe/hpc/lib/atlas/lib

--with-mpi-dir=/opt/libs/openmpi/openmpi-1.5.3 gcc-4.6.0

--with-debugging=1

--with-shared-libraries=1

The configuration script will have an output like the following one:

xxx=========================================================================xxx

Configure stage complete. Now build PETSc libraries with (legacy build):

make PETSC DIR=/export/home/ncit-cluster/stud/g/george.neagoe/hpc/petsc-3.2-p5

PETSC ARCH=arch-linux2-c-debug all or (experimental with python):

PETSC DIR=/export/home/ncit-cluster/stud/g/george.neagoe/hpc/petsc-3.2-p5

PETSC ARCH=arch-linux2-c-debug ./config/builder.py

xxx=========================================================================xxx

Then we can make the build of the PETSC library:

make all

102

Output:

Completed building libraries

=========================================

making shared libraries in

/export/home/ncit-cluster/stud/g/george.neagoe/

hpc/petsc-3.2-p5/arch-linux2-c-debug/lib

building libpetsc.so

In order to compile Deal II with the PETSC fresh compiled library we must set some envi-ronment variables, and also the LD LIBRARY PATH.

Petsc configuration parameters: PETSC DIR: this variable should point to the loca-tion of the PETSc installation that is used. Multiple PETSc versions can coexist on the samefile-system. By changing PETSC DIR value, one can switch between these installed versionsof PETSc.

PETSC ARCH: this variable gives a name to a configuration/build. Configure uses thisvalue to stores the generated config makefiles in

${PETSC DIR}/${PETSC ARCH}/conf. And make uses this value to determine thislocation of these makefiles [which intern help in locating the correct include and library files].

Thus one can install multiple variants of PETSc libraries - by providing different PETSC ARCHvalues to each configure build. Then one can switch between using these variants of librariesby switching the PETSC ARCH value used.

If configure doesn’t find a PETSC ARCH value [either in env variable or command lineoption], it automatically generates a default value and uses it. Also - if make doesn’t finda PETSC ARCH env variable - it defaults to the value used by last successful invocation ofprevious configure.

We must define the parameters in bash with the following commands before running theconfiguration script:

PETSC ARCH=arch-linux2-c-debug;

export PETSC ARCH

PETSC DIR=/export/home/ncit-cluster/stud/g/george.neagoe/hpc/petsc-3.2-p5;

export PETSC DIR

export LD LIBRARY PATH=

/export/home/ncit-cluster/stud/g/george.neagoe/hpc/petsc-3.2-p5/

arch-linux2-c-debug/lib:${LD LIBRARY PATH}:echo $LD LIBRARY PATH

After the PETSC library is complete we have encountered another problem we we tried torun the paralelized examples from the deal II library:

Exception on processing:

--------------------------------------------------------

An error occurred in line <98> of file

</export/home/ncit-cluster/stud/g/george.neagoe/hpc/deal.II v2/

source/lac/sparsity tools.cc>

in function void dealii::SparsityTools::partition

(const dealii::SparsityPattern&, unsigned int, std::vector<unsigned int>&)

The violated condition was: false

The name and call sequence of the exception was:

ExcMETISNotInstalled()

Additional Information: (none)

So, how we can notice, the problem was that the METIS library was not installed.

103

Installing METIS In order to generate partitionings of triangulations, we have functionsthat call METIS library. METIS is a library that provides various methods to partition graphs,which we use to define which cell belongs to which part of a triangulation. The main point inusing METIS is to generate partitions so that the interfaces between cell blocks are as smallas possible. This data can, in turn, be used to distribute degrees of freedom onto differentprocessors when using PETSc and/or SLEPc in parallel mode.

As with PETSc and SLEPc, the use of METIS is optional. If you wish to use it, you can doso by having a METIS installation around at the time of calling ./configure by either settingthe METIS DIR environment variable denoting the path to the METIS library, or using the–with-metis flag. If METIS was installed as part of /usr or /opt, instead of local directoriesin a home directory for example, you can use configure switches –with-metis-include, –with-metis-libs.

On some systems, when using shared libraries for deal.II, you may get warnings of thekind libmetis.a(pmetis.o): relocation R X86 64 32 against ‘a local symbol’ can not be usedwhen making a shared object; recompile with -fPIC when linking. This can be avoided byrecompiling METIS with -fPIC as a compiler flag.

METIS is not needed when using p4est to parallelize programs, see below.

9.5.6 Running Examples

The programs are in the examples/ directory of your local deal.II installation. Aftercompiling the library itself, if you go into one of the tutorial directories, you can compile theprogram by typing make, and run it using make run. The latter command also compiles theprogram if that has not already been done.

Example 1

cd /deal.II/examples/step-1

ls -l

total 1324

drwxr-xr-x 2 george.neagoe studcs 4096 Oct 9 22:42 doc

-rw-r--r-- 1 george.neagoe studcs 5615 Sep 21 18:12 Makefile

-rw-r--r-- 1 george.neagoe studcs 168998 Nov 8 03:44 Makefile.dep

-rwxr-xr-x 1 george.neagoe studcs 469880 Nov 8 03:44 step-1

-rw-r--r-- 1 george.neagoe studcs 18200 May 17 07:34 step-1.cc

-rw-r--r-- 1 george.neagoe studcs 666928 Nov 8 03:44 step-1.g.o

./step1

ls -l

-rw-r--r-- 1 george.neagoe studcs 29469 Nov 8 04:05 grid-1.eps

-rw-r--r-- 1 george.neagoe studcs 129457 Nov 8 04:05 grid-2.eps

Example 2


./step-2

ls -l

-rw-r--r-- 1 george.neagoe studcs 91942 Nov 8 04:16 sparsity pattern.1

-rw-r--r-- 1 george.neagoe studcs 92316 Nov 8 04:16 sparsity pattern.2

For viewing the 2D results we need to use gnuplot

gnuplot

Terminal type set to ’x11’

gnuplot> set style data points

104

Figure 1: grid-1.eps

Figure 2: grid-2.eps

105

Figure 3: sparsity pattern.1, gnuplot

Figure 4: sparsity pattern.2, gnuplot

106

Example 3


./step-3

ls -l

-rw-r--r-- 1 george.neagoe studcs 96288 Nov 8 04:30 solution.gpl

gnuplot gnuplot> set style data lines

gnuplot> splot solution.gpl

Or with Hidden 3D

gnuplot> set hidden3d

gnuplot> splot solution.gpl

Example 17

cd step-17/

PETSC ARCH=arch-linux2-c-debug;

export PETSC ARCH

PETSC DIR=/export/home/ncit-cluster/stud/g/george.neagoe/hpc/petsc-3.2-p5;

export PETSC DIR

export LD LIBRARY PATH=/export/home/ncit-cluster/stud/g/george.neagoe/

hpc/petsc-3.2-p5/arch-linux2-c-debug/lib:${LD LIBRARY PATH}:echo $LD LIBRARY PATH

mpirun {np 8./step-17

ls -l

-rw-r--r-- 1 george.neagoe studcs 9333 Jan 10 00:24 solution-0.gmv










The Results are meshes *.gmv (general mesh view) files. Some of the result files of otherprograms output (*.vtk) can be viewed using paraview module only on fep.grid.pub.ro.

module load tools/ParaView-3.8.1

paraview

After this we can view the file using the paraview menu. For starting the application, weneed to make sure that we have connected to the cluster, with X11 port forwarding. Example:

ssh -X [email protected]

107

Figure 5: solution.gpl, gnuplot, 3D normal

Figure 6: solution.gpl, gnuplot, hidden3d

108

References

[1] The RWTH HPC Primer,http://www.rz.rwth-aachen.de/go/id/pil/lang/en

[2] Wikipedia page on SPARC processors,http://www.rz.rwth-aachen.de/go/id/pil/lang/en

[3] Sunsolve page on the Sun Enterprise 220R,http://sunsolve.sun.com/handbook_pub/validateUser.do?target=Systems/E220R/E220R

[4] Sunsolve page on the Sun Enterprise 420R,http://sunsolve.sun.com/handbook_pub/svalidateUser.do?target=Systems/E420R/E420R

110

The NCIT Cluster Resources User’s Guidecluster.grid.pub.ro/images/clusterguide-v4.0.pdf ·...

Documents

Transcript of The NCIT Cluster Resources User’s Guidecluster.grid.pub.ro/images/clusterguide-v4.0.pdf ·...