Headline in Arial Bold 30pt NMCAC Knowledge Transfer 06/16/2008.

Headline in Arial Bold 30ptNMCAC Knowledge Transfer• NMCAC Knowledge Transfer

• 06/16/2008

Overall Top-level Diagram

10 Gig-E

EncantoAltix ICE System

14,336 Compute Cores1,792 Nodes

36 Racks

UNMAltix ICE Exemplar176 Compute Cores

22 Nodes1 Rack

NMSUAltix ICE Exemplar176 Compute Cores

22 Nodes1 Rack

New Mexico TechAltix ICE Exemplar176 Compute Cores

22 Nodes1 Rack

State of New MexicoNetwork

UNM Gateway NMSU Gateway NM Tech Gateway

Micro Gateway Micro Gateway Micro Gateway Micro Gateway

National Lambda Rail

Encanto

Major Components

• 2.1 TFLOPS Altix ICE 8200 Cluster

• 22 Nodes, 44 Sockets, 176 Cores

• 352 TB memory (2GB/core)

• 16 TB NAS Storage Subsystem

• 1 System Admin Controller

• 1 Rack Leader Controller

• 1 Login Service Node

• Lambda Rail Connectivity

• 1 Consolidated Rack

Exemplar System Overview

Exemplar System Rack

Rack Leader Controller – 6015B

Altix ICE 8200 Blade Enclosure (11 blades)

Administrator Console

Altix 450

NAS Storage CUBEIS4000 Controller and 16 Drives

Altix ICE 8200 Blade Enclosure (11 blades)

Sys Admin Controller – 6015BLogin Node – 6015B

Drive Tray (16 Drives)

Altix ICE Cluster

MCHGreencreek

3.0MHz Quad-core Xeons

ESB-2E SIO3

Serial Int

Flash

FLASH

FBD 533/677

FBD 533/677

FBD 533/677

FBD 533/677

1066/1333 MTS

PCIe x8 (4GB/s)

DMI x4

X4 DDR IB

PCIe x8 (4GB/s)

X4 DDR IFB (4GB/s)

X4 DDR IB PCIe x8 (4GB/s)

X4 DDR IFB (4GB/s)

GbE

GbE

BMC

PCIe

Unused

Altix ICE 8200 Compute Blade

Altix ICE 8200 Blade Enclosure Interconnect Diagram

Co

mp

ute 00

Co

mp

ute 01

Co

mp

ute 02

Co

mp

ute 03

Co

mp

ute 04

Co

mp

ute 05

Co

mp

ute 06

Co

mp

ute 07

Co

mp

ute 08

Co

mp

ute 09

Co

mp

ute 10

Co

mp

ute 11

Co

mp

ute 12

Co

mp

ute 13

Co

mp

ute 14

Co

mp

ute 15

DD

R In

finiB

an

dS

witch

Blad

eD

DR

In

fin

iBa

nd

Sw

itch

Bla

de

Exemplar Top-level Diagram

33 IB Lines

1Lamba Rail

Fiber

Lambda RailGateway

33

IB DDRGig-E 10 Gig-E

Login ServiceNode

System AdminController

4

Altix ICE 8200 Cluster176 Compute Cores

22 Nodes1 Rack

2

NASStorage Subsystem

16 TB Raw

Altix 450NAS Head Node

Exemplar NAS Storage Subsystem

4

IB

IS4000RAID-controller

FC FC

FC FC

11

SATA RAIDDrive Tray

FC

SATA RAIDDrive Tray

FC

06/13/08

Tempo Administrative Commands

cpower control power for cluster node(s)

cimage manage compute blade images

cexec run commands on cluster node(s)

console open system console to cluster node

06/13/08

Tempo Administrative Commands (cont.)

cadmin set/show certain cluster parameters

ckill kill processes on cluster node(s)

cget get file from cluster node(s)cpush push file to cluster node(s)

clist list names of cluster node(s)cnum return node number based on namecname return node position based on name

06/13/08

cpower

cpower examplescpower r1i*cpower --off r2i*cpower --iru --up r1* cpower --rack --up r4cpower --rack --noleader --off r4cpower --halt --systemcpower --off –system

06/13/08

Power up the cluster

Exemplar power upboot admin through bmc or power buttontelnet service1-l2 <-- system controller for NAS

L2> pwr uctrl-\ terminate telnetctrl-t go to L2> prompt from console

cpower –reboot –systemuse smadmin to start opensm on both fabrics once cluster is

booted

06/13/08

Power down the cluster

Exemplar power downcpower –halt –systemssh service1 halttelnet service1-l2

L2> pwr dctrl-\ terminate telnetctrl-t go to L2> prompt from console

halt admin

06/13/08

Creating images

cp /etc/opt/sgi/rpmlists/service-sles10sp1.rpmlist /etc/opt/sgi/rpmlists/new-service-node.rpmlist

mksiimage -A --name new-service-node-image --location /tftpboot/distro/sles-10-x86_64, /tftpboot/oscar/common-rpms,/tftpboot/oscar/sles-10-x86_64 --filename /etc/opt/sgi/rpmlists/new-service-sles10sp1.rpmlist

cimage --add-db new-compute-node-image, if compute node

post-process-sgi-image /var/lib/systemimager/images/new-service-node-image/ eth1

06/13/08

Adding RPMs to images

cd /var/lib/systemimager/images/cp /newrpm.rpm new-compute-sles10sp1/tmpchroot new-compute-sles10sp1 bashrpm -Uvh /tmp/newrpm.rpmrm /tmp/newrpm.rpm <-- clean up /tmp direxit <-- get out of chroot

06/13/08

Adding RPMs to images (cont.)

cimage --push-rack new r\*cimage --set new-compute-sles10sp1 2.6.16.54-0.2.5-

smp "r*i*n*"

cimage --del-db new-compute-sles10sp1cimage --add-db new-compute-sles10sp1

cimage --list-nodes r1cimage --del-image mynewimage

06/13/08

cimage command

Show images and compute node boot kernelscimage –list-imagescimage –list-nodes

Push modified images to rackcimage –push-rack new-compute-sles10sp1 r\*

Set compute node kernel to bootcimage –set new-compute-sles10sp1 kernel_name r1i0n0

Clone compute imagecimage --clone-image compute-sles10sp1 new-compute-

sles10sp1

06/13/08

cimage command (cont.)

Adding/updating kernels in compute imagescimage --del-db new-compute-sles10sp1cimage –add-db new-compute-sles10sp1

Removing images from Tempocimage --del-image new-compute-sles10sp1

06/13/08

cexec

cexec --all hostnamecexec --head hostnamecexec -f /etc/c3svc.conf hostnamecexec rack_1:16 hostname, cexec blades:16 hostname

count booted nodescexec -p pwd | grep -c root

06/13/08

console and ipmitool

console runs via a conserver daemon on the leader node which uses ipmitool to open consoles on nodes

console logsadmin:/net/r1lead/var/log/consoles/

console service0console r1i0n0

“ctrl-e,c,.” to get out of console

06/13/08

console and ipmitool (cont.)

ipmitool is used to communicate with the bmc's of cluster nodesmonitors power and temperatureconnected to aux power, so it powers up when node is plugged

into AC

Setting IP address of admin's bmcneed to start ipmi daemon to communicate with local node's bmc

/etc/init.d/ipmistartipmitool lan print 1ipmitool lan set 1 ipaddr x.x.x.xipmitool lan set 1 netmask x.x.x.xipmitool lan set 1 defgw x.x.x.x/etc/init.d/ipmi stop

06/13/08

console and ipmitool (cont.)

Accessing remote bmcipmitool -I lanplus -o supermicro -H service0-bmc -U ADMIN -P

ADMIN power status

06/13/08

cadmin

power down a node for maintenencepower --down r1i0n0cadmin --set-admin-status --node r1i0n0 offline

power up a node after maintenencecadmin --set-admin-status --node r1i0n0 onlinecpower --boot r1i0n0

set boot order of service nodescadmin --set-boot-order --node service0 2

0 = skip node in cpower

06/13/08

InfiniBand commands

ibstat show current HCA statsibstatus show link rate and state of HCA

State: Initializing opensm not runningState: Active opensm running

perfquery find IB errors and reset countersperfquery –P 1 –R –aperfquery –P 2 –R –a

ibdiagnet scans fabric and prints any errors ibcheckwidth, ibcheckerrorsibnetdiscover discovers IB network

06/13/08

opensm

Configure opensm. It is typically only run once, usually during original Tempo install.smconfig -f ib0smconfig -f ib1

Start opensm on each fabric. This must be done manually after a cluster power up.smadmin -f ib0 -usmadmin -f ib1 -u

06/13/08

opensm (cont.)

Get status of opensm on a fabricsmadmin -f ib0 -s

Restart opensm on a fabricsmadmin -f ib0 -r

06/13/08

ivt – Inventory Verification Tool

ivt -M take snapshot

ivt -L list snapshots

ivt -S analyze snapshots

ivt -Q create custome ivt queries

06/13/08

Monitoring

SELCritical hardware events are gathered for the leader nodes,

compute blades, and CMC's and logged in the following locations:

- /var/log/messages- /var/log/sel/sel.log- Embedded Support Partner (ESP)

Gangliahttp://admin/ganglia

Node availability (60sec Heartbeat)gmetad daemon that runs on admin nodegmond daemon that runs on leader, service, and

compute nodes

http://admin/ganglia

06/13/08

Monitoring (cont.)

ESP – Embedded Support Partnerhttp://admin:5554Must manually register nodes and complete customer profile in

webpageusername: administratorpassword: partnerESP Administration > Customer Profile

Fill out required fieldsClick Add, Click Commit

ESP User Guide, man pages

http://admin:5554/

06/13/08

Monitoring (cont.)

PCP – Performance Co-PilotCollect SDR information

sensor informationcluster statistics

pmgcluster, clustervis, pmicepmchart -h r1i0n0

create or view charts

PCP user Guide, man pages

06/13/08

Software Repositories and Updates

Repository locations on admin node/tftpboot/distro/sles-10-x86_64 SLES10/tftpboot/oscar/common-rpms SGI Tempo/tftpboot/oscar/sles-10-x86_64 SGI ProPack

sync-repo-updatesConfigure to use with SLES

Regsiter admin node with Novell

Configure for use with Tempo and ProPackCreate SGI Supportfolio Account

Configure yup

06/13/08

Software Repositories and Updates (cont.)

Use yum to update cluster nodesNodes with disks

ssh service0 yum -y updatessh r1lead yum -y updateyum -y update (admin node)

Imagesyum-image-wrapper /var/lib/systemimager/images/compute-clone update

Installing a package with yumssh service0 yum install zlib-devel

06/13/08

Other Tempo related commands

tempo-info-gather collects information about the cluster. It is sometimes requested by SGI support and creates a large output file.

firmware_revs displays firwmare revisions for BIOS, BMC's, and CMC's and IB in cluster

dbdump dumps Tempo database

Backup/Restore Tempo databasemysqldump --opt oscar > backup-database.sqlmysql oscar < backup-database.sql

06/13/08

Diagnostics

SGI Diagnostics loaded in /usr/diags/bin

Memorycexec --all /usr/diags/bin/olcmt PERCENT=98 REPEAT=5ipmitool –v –I lanplus –o supermicro –H <IPaddr> -U ADMIN –P

ADMIN sel list | grep ECC

Stress/usr/diags/bin/pandora percent=80 –runtime 15

06/13/08

NAS

NAS web interfacehttps://service1:1178

The NAS is not managed by TempoNeeds to be power cycled manually

ssh service1 halttelnet service1-l2L2> pwrL2> pwr dL2>pwr t

ctrl-d go to system consolectrl-t go to L2 prompt (type L2 to stay at prompt)

Altix 450 User Guide, AppMan User Guide

https://service1:1178/

06/13/08

Intel Compilers

Intel CompilersLocated in /apps/intel

License manager runs on service0/etc/init.d/Intel.lmgrd startup scriptps -ef | grep Intel check if license server running

modulefiles loaded for compilersmodule availmodule loadmodule unload

Launching MPI jobs on Encanto

• SGI MPT % mpirun -v -f mpt.hosts.$$ -np 8 ./hellohostf_mpt

Example: /home/examples/mpi/smpi/run_smpi

• MVAPICH MPI % mpirun -machinefile mpd.hosts.$$ -np 16 ./hellohostf_mv

Example: /home/examples/mpi/mvmpi/run_mvmpi

• Intel MPI % mpdboot -n 2 -f mpd.hosts.$$ -r ssh

% mpiexec -machinefile mpd.hosts.$$ -np 16 ./hellohostf_impiExample: /home/examples/mpi/impi/run_impi

• Open MPI% mpiexec -machinefile mpd.hosts.$$ \

-byslot -mca btl_openib_warn_default_gid_prefix 0 \ -mca oob_tcp_peer_retries 300 \ -mca btl openib,sm,self \ -np 16 ./hellohostf_ompi

Example: /home/examples/mpi/ompi/run_ompi

06/13/08

Documentation

SGI Technical Publicationshttp://techpubs.sgi.com

SGI Supportfoliohttp://www.sgi.com/support

http://techpubs.sgi.com/

Support for Exemplar Systems – Full Care

SGI Full Care Support• FullCare support provides hardware and software support for Encanto

on a next business day priority response. • Software support includes the ability to call the support center and ask

questions, file bugs, etc on covered software.• FullCare support extends the manufacturer’s Warranty that only covers

the hardware and would include limited technical assistance from the call center. Warranty only covers failed hardware, not software.

• Hardware is defined as all physical components and would include the disks, node boards, login and admin servers, network interface cards, and internal cabling.

• Next Business Day response is 8x5, eight hours a day 5 days a week during normal business hours and excludes holidays.

Support for Exemplar Systems – Call Logging

• To Log a Trouble Ticket:– Call: 1-800-800-4SGI or go to support.sgi.com and

use Supportfolio. • Reference the following Serial numbers

– UNM Exemplar System Serial Number Z0000068– NMSU Exemplar System Serial Number Z0000070– NMTech Exemplar System Serial Number Z0000071

Support for Exemplar Systems – Call Process

SGI Customer Education for Altix ICE

Introduction to the Linux® Operating System This course introduces students to basic command-line tools that

the Linux operating system provides. Linux® System Administration This course introduces Linux command line systeadministration to

users who have completed a basic Linux, UNIX, or IRIX class Linux® Network Administration This course provides experienced system administrators with the

necessary skills to configure, manage, and troubleshoot the SGI Linux™ open-source operating system in a TCP/IP networked environment.

SGI Customer Education for Altix ICE (cont.)

SGI® Altix® System Administration I (SLES based) This course provides the experienced Linux® user with the skills and

information needed to administer the SGI Altix 3000/4000 family of servers and superclusters.

SGI® Altix® System Administration II (SLES based) This course provides the experienced Linux® user with the skills and

information needed to administer the SGI Altix 3000 family of servers and superclusters.

SGI® Altix® ICE Cluster Administration The Altix ICE Cluster Administration course provides knowledge and

practice in basic cluster administration areas as IPMI configuration, SGI Tempo cluster software installation and configuration, Torque configuration and job submittal, Infiniband configuration and cluster monitoring and troubleshooting using Ganglia tools and Performance Co-Pilot tools.

• Note classes are 4.5 days long and offered in multiple US locations

Questions?

Headline in Arial Bold 30pt

PBS Pro Job SubmissionNMCAC – Encanto 14338 Core Cluster

Scott [email protected]

Overview

• Intro to PBS Shell Scripting• How to Submit a PBS Job• Tuning Job Placement Using Resource Definitions• PBS Commands to know

### PBS example script#!/bin/bash#PBS -l select=16:ncpus=8:mpiprocs=8#PBS -l walltime=01:00:00#PBS –N examplejob#PBS -j oe#PBS -V

cd $PBS_O_WORKDIR

cat $PBS_NODEFILE > mpd.hosts

source /usr/share/modules/init/bashmodule purgemodule load cc/9.1.052 fc/9.1.052 openmpi_intelmodule list

mpirun –machinefile ./mpd.hosts –np 128 ./hello_world

Intro to PBS Shell Scripting

PBS Shell scripting can be a basic or complex script depending requirements of the job type.

Step1: Specify a shell typeStep2: Specify the PBS Resource DefinitionsStep3: Change to the Working DirectoryStep4: Load compiler and MPI ImplementationStep5: Execute application

To run a 128 core job with 8 MPI threads & use 16 blades

• #PBS -l select=16:ncpus=8:mpiprocs=8

To run a 64 core job with 4 MPI threads & use 16 blades• #PBS -l select=16:ncpus=8:mpiprocs=4

To run a 128 core job with 8 MPI threads & use 16 blades

• #PBS -l select=16:ncpus=8:mpiprocs=8

To run a 64 core job with 4 MPI threads & use 16 blades• #PBS -l select=16:ncpus=8:mpiprocs=4

You *must* specify a PBS walltime, default is set to 1 minute

How to Submit a PBS Job

• PBS Pro supports two modes for job submission– Batch mode

qsub examplejob

or

qsub –l walltime=2:30:00 examplejob

– Interactive (shell) mode for debugging scripts or applications

qsub –I examplejob

Increase runtime +1.5 hoursIncrease runtime +1.5 hours

service0:~> qsub -I examplejobqsub: waiting for job 129301.service0 to startqsub: job 129301.service0 ready

Start Prologue v2.1 Fri Jun 13 07:34:07 MDT 2008 End Prologue v2.1 Fri Jun 13 07:34:07 MDT 2008

scott_shaw@r14i0n6:~>

service0:~> qsub -I examplejobqsub: waiting for job 129301.service0 to startqsub: job 129301.service0 ready

Start Prologue v2.1 Fri Jun 13 07:34:07 MDT 2008 End Prologue v2.1 Fri Jun 13 07:34:07 MDT 2008

scott_shaw@r14i0n6:~>

Tuning Job Placement - Using PBS Resource Definitions

Important concepts– Compute blades/nodes: 2 Quad Cores, 16GB Memory, & Diskless blades– IRU: houses 16 blades per IRU and each Rack has four IRUs – RACK: houses Four IRUs, 64 Compute blades or 512 processor cores L 1 Display

L 1 Display

L 1 Display

L 1 Display

L 1 Display

L 1 Display

L 1 Display

L 1 Display

Blades

IRU

### PBS example script showing job placement#!/bin/bash#PBS -l select=16:ncpus=8:mpiprocs=8#PBS -l place=scatter:excl:group=iru#PBS -l walltime=01:00:00#PBS –N examplejob#PBS -j oe#PBS -V

cd $PBS_O_WORKDIR

rm -f mpd.hostscat $PBS_NODEFILE > mpd.hosts

source /usr/share/modules/init/bashmodule purgemodule load cc/9.1.052 fc/9.1.052 openmpi_intelmodule list

mpirun –machinefile ./mpd.hosts –np 128 ./hello_world

Resource Types:IRU – isolate PBS job within least number of IRUsRACK - isolate PBS job within least number of RacksDefault – without specifying a group= statement any free blade/irc/rack will be used

Resource Types:IRU – isolate PBS job within least number of IRUsRACK - isolate PBS job within least number of RacksDefault – without specifying a group= statement any free blade/irc/rack will be used

RACK

PBS Commands to know

• User Commands:

qsub – the command to submit a PBS queue scriptqhold – place a currently Q or R job on hold “H”qalter – allow changing of submission parametersqrls - release a job from a “H” stateqdel - delete a job from the queue, cmdline opts “–W force {jobid}” qrerun – rerun a previously submitted job, must know jobid

• User Monitoring Commands:

qstat - the command used to check the PBS queuetracejob – command used to review details of current/previous jobspbsnodes – command used to review offline blades/nodes, -l

PBS Commands to know

Monitor the PBS Queue, in a top like output

watch –n 5 “qstat –a”Check the PBS job(s) status in the queue

qstat –s {jobid}Check PBS Job(s) to see which blades/nodes are being used

qstat –n {jobid}Output the job(s) details

qstat –f {jobid}Output the free blades/nodes not in use

pbsnodes -a |grep -B 3 "state = free" |grep Mom |awk '{print $3}'

Questions?

Headline in Arial Bold 30pt NMCAC Knowledge Transfer 06/16/2008.

Documents

Transcript of Headline in Arial Bold 30pt NMCAC Knowledge Transfer 06/16/2008.