Headline in Arial Bold 30pt Trends in Storage Migration Marti Baldwin, Manager, SGI Solutions Lab.
Headline in Arial Bold 30pt NMCAC Knowledge Transfer 06/16/2008.
-
Upload
daniela-evans -
Category
Documents
-
view
228 -
download
2
Transcript of Headline in Arial Bold 30pt NMCAC Knowledge Transfer 06/16/2008.
Headline in Arial Bold 30ptNMCAC Knowledge Transfer• NMCAC Knowledge Transfer
• 06/16/2008
Overall Top-level Diagram
10 Gig-E
EncantoAltix ICE System
14,336 Compute Cores1,792 Nodes
36 Racks
UNMAltix ICE Exemplar176 Compute Cores
22 Nodes1 Rack
NMSUAltix ICE Exemplar176 Compute Cores
22 Nodes1 Rack
New Mexico TechAltix ICE Exemplar176 Compute Cores
22 Nodes1 Rack
State of New MexicoNetwork
UNM Gateway NMSU Gateway NM Tech Gateway
Micro Gateway Micro Gateway Micro Gateway Micro Gateway
National Lambda Rail
Encanto
Major Components
• 2.1 TFLOPS Altix ICE 8200 Cluster
• 22 Nodes, 44 Sockets, 176 Cores
• 352 TB memory (2GB/core)
• 16 TB NAS Storage Subsystem
• 1 System Admin Controller
• 1 Rack Leader Controller
• 1 Login Service Node
• Lambda Rail Connectivity
• 1 Consolidated Rack
Exemplar System Overview
Exemplar System Rack
Rack Leader Controller – 6015B
Altix ICE 8200 Blade Enclosure (11 blades)
Administrator Console
Altix 450
NAS Storage CUBEIS4000 Controller and 16 Drives
Altix ICE 8200 Blade Enclosure (11 blades)
Sys Admin Controller – 6015BLogin Node – 6015B
Drive Tray (16 Drives)
Altix ICE Cluster
MCHGreencreek
3.0MHz Quad-core Xeons
ESB-2E SIO3
Serial Int
Flash
FLASH
FBD 533/677
FBD 533/677
FBD 533/677
FBD 533/677
1066/1333 MTS
PCIe x8 (4GB/s)
DMI x4
X4 DDR IB
PCIe x8 (4GB/s)
X4 DDR IFB (4GB/s)
X4 DDR IB PCIe x8 (4GB/s)
X4 DDR IFB (4GB/s)
GbE
GbE
BMC
PCIe
Unused
Altix ICE 8200 Compute Blade
Altix ICE 8200 Blade Enclosure Interconnect Diagram
Co
mp
ute 00
Co
mp
ute 01
Co
mp
ute 02
Co
mp
ute 03
Co
mp
ute 04
Co
mp
ute 05
Co
mp
ute 06
Co
mp
ute 07
Co
mp
ute 08
Co
mp
ute 09
Co
mp
ute 10
Co
mp
ute 11
Co
mp
ute 12
Co
mp
ute 13
Co
mp
ute 14
Co
mp
ute 15
DD
R In
finiB
an
dS
witch
Blad
eD
DR
In
fin
iBa
nd
Sw
itch
Bla
de
Exemplar Top-level Diagram
33 IB Lines
1Lamba Rail
Fiber
Lambda RailGateway
33
IB DDRGig-E 10 Gig-E
Login ServiceNode
System AdminController
4
Altix ICE 8200 Cluster176 Compute Cores
22 Nodes1 Rack
2
NASStorage Subsystem
16 TB Raw
Altix 450NAS Head Node
Exemplar NAS Storage Subsystem
4
IB
IS4000RAID-controller
FC FC
FC FC
11
SATA RAIDDrive Tray
FC
SATA RAIDDrive Tray
FC
06/13/08
Tempo Administrative Commands
cpower control power for cluster node(s)
cimage manage compute blade images
cexec run commands on cluster node(s)
console open system console to cluster node
06/13/08
Tempo Administrative Commands (cont.)
cadmin set/show certain cluster parameters
ckill kill processes on cluster node(s)
cget get file from cluster node(s)cpush push file to cluster node(s)
clist list names of cluster node(s)cnum return node number based on namecname return node position based on name
06/13/08
cpower
cpower examplescpower r1i*cpower --off r2i*cpower --iru --up r1* cpower --rack --up r4cpower --rack --noleader --off r4cpower --halt --systemcpower --off –system
06/13/08
Power up the cluster
Exemplar power upboot admin through bmc or power buttontelnet service1-l2 <-- system controller for NAS
L2> pwr uctrl-\ terminate telnetctrl-t go to L2> prompt from console
cpower –reboot –systemuse smadmin to start opensm on both fabrics once cluster is
booted
06/13/08
Power down the cluster
Exemplar power downcpower –halt –systemssh service1 halttelnet service1-l2
L2> pwr dctrl-\ terminate telnetctrl-t go to L2> prompt from console
halt admin
06/13/08
Creating images
cp /etc/opt/sgi/rpmlists/service-sles10sp1.rpmlist /etc/opt/sgi/rpmlists/new-service-node.rpmlist
mksiimage -A --name new-service-node-image --location /tftpboot/distro/sles-10-x86_64, /tftpboot/oscar/common-rpms,/tftpboot/oscar/sles-10-x86_64 --filename /etc/opt/sgi/rpmlists/new-service-sles10sp1.rpmlist
cimage --add-db new-compute-node-image, if compute node
post-process-sgi-image /var/lib/systemimager/images/new-service-node-image/ eth1
06/13/08
Adding RPMs to images
cd /var/lib/systemimager/images/cp /newrpm.rpm new-compute-sles10sp1/tmpchroot new-compute-sles10sp1 bashrpm -Uvh /tmp/newrpm.rpmrm /tmp/newrpm.rpm <-- clean up /tmp direxit <-- get out of chroot
06/13/08
Adding RPMs to images (cont.)
cimage --push-rack new r\*cimage --set new-compute-sles10sp1 2.6.16.54-0.2.5-
smp "r*i*n*"
cimage --del-db new-compute-sles10sp1cimage --add-db new-compute-sles10sp1
cimage --list-nodes r1cimage --del-image mynewimage
06/13/08
cimage command
Show images and compute node boot kernelscimage –list-imagescimage –list-nodes
Push modified images to rackcimage –push-rack new-compute-sles10sp1 r\*
Set compute node kernel to bootcimage –set new-compute-sles10sp1 kernel_name r1i0n0
Clone compute imagecimage --clone-image compute-sles10sp1 new-compute-
sles10sp1
06/13/08
cimage command (cont.)
Adding/updating kernels in compute imagescimage --del-db new-compute-sles10sp1cimage –add-db new-compute-sles10sp1
Removing images from Tempocimage --del-image new-compute-sles10sp1
06/13/08
cexec
cexec --all hostnamecexec --head hostnamecexec -f /etc/c3svc.conf hostnamecexec rack_1:16 hostname, cexec blades:16 hostname
count booted nodescexec -p pwd | grep -c root
06/13/08
console and ipmitool
console runs via a conserver daemon on the leader node which uses ipmitool to open consoles on nodes
console logsadmin:/net/r1lead/var/log/consoles/
console service0console r1i0n0
“ctrl-e,c,.” to get out of console
06/13/08
console and ipmitool (cont.)
ipmitool is used to communicate with the bmc's of cluster nodesmonitors power and temperatureconnected to aux power, so it powers up when node is plugged
into AC
Setting IP address of admin's bmcneed to start ipmi daemon to communicate with local node's bmc
/etc/init.d/ipmistartipmitool lan print 1ipmitool lan set 1 ipaddr x.x.x.xipmitool lan set 1 netmask x.x.x.xipmitool lan set 1 defgw x.x.x.x/etc/init.d/ipmi stop
06/13/08
console and ipmitool (cont.)
Accessing remote bmcipmitool -I lanplus -o supermicro -H service0-bmc -U ADMIN -P
ADMIN power status
06/13/08
cadmin
power down a node for maintenencepower --down r1i0n0cadmin --set-admin-status --node r1i0n0 offline
power up a node after maintenencecadmin --set-admin-status --node r1i0n0 onlinecpower --boot r1i0n0
set boot order of service nodescadmin --set-boot-order --node service0 2
0 = skip node in cpower
06/13/08
InfiniBand commands
ibstat show current HCA statsibstatus show link rate and state of HCA
State: Initializing opensm not runningState: Active opensm running
perfquery find IB errors and reset countersperfquery –P 1 –R –aperfquery –P 2 –R –a
ibdiagnet scans fabric and prints any errors ibcheckwidth, ibcheckerrorsibnetdiscover discovers IB network
06/13/08
opensm
Configure opensm. It is typically only run once, usually during original Tempo install.smconfig -f ib0smconfig -f ib1
Start opensm on each fabric. This must be done manually after a cluster power up.smadmin -f ib0 -usmadmin -f ib1 -u
06/13/08
opensm (cont.)
Get status of opensm on a fabricsmadmin -f ib0 -s
Restart opensm on a fabricsmadmin -f ib0 -r
06/13/08
ivt – Inventory Verification Tool
ivt -M take snapshot
ivt -L list snapshots
ivt -S analyze snapshots
ivt -Q create custome ivt queries
06/13/08
Monitoring
SELCritical hardware events are gathered for the leader nodes,
compute blades, and CMC's and logged in the following locations:
- /var/log/messages- /var/log/sel/sel.log- Embedded Support Partner (ESP)
Gangliahttp://admin/ganglia
Node availability (60sec Heartbeat)gmetad daemon that runs on admin nodegmond daemon that runs on leader, service, and
compute nodes
06/13/08
Monitoring (cont.)
ESP – Embedded Support Partnerhttp://admin:5554Must manually register nodes and complete customer profile in
webpageusername: administratorpassword: partnerESP Administration > Customer Profile
Fill out required fieldsClick Add, Click Commit
ESP User Guide, man pages
06/13/08
Monitoring (cont.)
PCP – Performance Co-PilotCollect SDR information
sensor informationcluster statistics
pmgcluster, clustervis, pmicepmchart -h r1i0n0
create or view charts
PCP user Guide, man pages
06/13/08
Software Repositories and Updates
Repository locations on admin node/tftpboot/distro/sles-10-x86_64 SLES10/tftpboot/oscar/common-rpms SGI Tempo/tftpboot/oscar/sles-10-x86_64 SGI ProPack
sync-repo-updatesConfigure to use with SLES
Regsiter admin node with Novell
Configure for use with Tempo and ProPackCreate SGI Supportfolio Account
Configure yup
06/13/08
Software Repositories and Updates (cont.)
Use yum to update cluster nodesNodes with disks
ssh service0 yum -y updatessh r1lead yum -y updateyum -y update (admin node)
Imagesyum-image-wrapper /var/lib/systemimager/images/compute-clone update
Installing a package with yumssh service0 yum install zlib-devel
06/13/08
Other Tempo related commands
tempo-info-gather collects information about the cluster. It is sometimes requested by SGI support and creates a large output file.
firmware_revs displays firwmare revisions for BIOS, BMC's, and CMC's and IB in cluster
dbdump dumps Tempo database
Backup/Restore Tempo databasemysqldump --opt oscar > backup-database.sqlmysql oscar < backup-database.sql
06/13/08
Diagnostics
SGI Diagnostics loaded in /usr/diags/bin
Memorycexec --all /usr/diags/bin/olcmt PERCENT=98 REPEAT=5ipmitool –v –I lanplus –o supermicro –H <IPaddr> -U ADMIN –P
ADMIN sel list | grep ECC
Stress/usr/diags/bin/pandora percent=80 –runtime 15
06/13/08
NAS
NAS web interfacehttps://service1:1178
The NAS is not managed by TempoNeeds to be power cycled manually
ssh service1 halttelnet service1-l2L2> pwrL2> pwr dL2>pwr t
ctrl-d go to system consolectrl-t go to L2 prompt (type L2 to stay at prompt)
Altix 450 User Guide, AppMan User Guide
06/13/08
Intel Compilers
Intel CompilersLocated in /apps/intel
License manager runs on service0/etc/init.d/Intel.lmgrd startup scriptps -ef | grep Intel check if license server running
modulefiles loaded for compilersmodule availmodule loadmodule unload
Launching MPI jobs on Encanto
• SGI MPT % mpirun -v -f mpt.hosts.$$ -np 8 ./hellohostf_mpt
Example: /home/examples/mpi/smpi/run_smpi
• MVAPICH MPI % mpirun -machinefile mpd.hosts.$$ -np 16 ./hellohostf_mv
Example: /home/examples/mpi/mvmpi/run_mvmpi
• Intel MPI % mpdboot -n 2 -f mpd.hosts.$$ -r ssh
% mpiexec -machinefile mpd.hosts.$$ -np 16 ./hellohostf_impiExample: /home/examples/mpi/impi/run_impi
• Open MPI% mpiexec -machinefile mpd.hosts.$$ \
-byslot -mca btl_openib_warn_default_gid_prefix 0 \ -mca oob_tcp_peer_retries 300 \ -mca btl openib,sm,self \ -np 16 ./hellohostf_ompi
Example: /home/examples/mpi/ompi/run_ompi
06/13/08
Documentation
SGI Technical Publicationshttp://techpubs.sgi.com
SGI Supportfoliohttp://www.sgi.com/support
Support for Exemplar Systems – Full Care
SGI Full Care Support• FullCare support provides hardware and software support for Encanto
on a next business day priority response. • Software support includes the ability to call the support center and ask
questions, file bugs, etc on covered software.• FullCare support extends the manufacturer’s Warranty that only covers
the hardware and would include limited technical assistance from the call center. Warranty only covers failed hardware, not software.
• Hardware is defined as all physical components and would include the disks, node boards, login and admin servers, network interface cards, and internal cabling.
• Next Business Day response is 8x5, eight hours a day 5 days a week during normal business hours and excludes holidays.
Support for Exemplar Systems – Call Logging
• To Log a Trouble Ticket:– Call: 1-800-800-4SGI or go to support.sgi.com and
use Supportfolio. • Reference the following Serial numbers
– UNM Exemplar System Serial Number Z0000068– NMSU Exemplar System Serial Number Z0000070– NMTech Exemplar System Serial Number Z0000071
Support for Exemplar Systems – Call Process
SGI Customer Education for Altix ICE
Introduction to the Linux® Operating System This course introduces students to basic command-line tools that
the Linux operating system provides. Linux® System Administration This course introduces Linux command line systeadministration to
users who have completed a basic Linux, UNIX, or IRIX class Linux® Network Administration This course provides experienced system administrators with the
necessary skills to configure, manage, and troubleshoot the SGI Linux™ open-source operating system in a TCP/IP networked environment.
SGI Customer Education for Altix ICE (cont.)
SGI® Altix® System Administration I (SLES based) This course provides the experienced Linux® user with the skills and
information needed to administer the SGI Altix 3000/4000 family of servers and superclusters.
SGI® Altix® System Administration II (SLES based) This course provides the experienced Linux® user with the skills and
information needed to administer the SGI Altix 3000 family of servers and superclusters.
SGI® Altix® ICE Cluster Administration The Altix ICE Cluster Administration course provides knowledge and
practice in basic cluster administration areas as IPMI configuration, SGI Tempo cluster software installation and configuration, Torque configuration and job submittal, Infiniband configuration and cluster monitoring and troubleshooting using Ganglia tools and Performance Co-Pilot tools.
• Note classes are 4.5 days long and offered in multiple US locations
Questions?
Headline in Arial Bold 30pt
PBS Pro Job SubmissionNMCAC – Encanto 14338 Core Cluster
Scott [email protected]
Overview
• Intro to PBS Shell Scripting• How to Submit a PBS Job• Tuning Job Placement Using Resource Definitions• PBS Commands to know
### PBS example script#!/bin/bash#PBS -l select=16:ncpus=8:mpiprocs=8#PBS -l walltime=01:00:00#PBS –N examplejob#PBS -j oe#PBS -V
cd $PBS_O_WORKDIR
cat $PBS_NODEFILE > mpd.hosts
source /usr/share/modules/init/bashmodule purgemodule load cc/9.1.052 fc/9.1.052 openmpi_intelmodule list
mpirun –machinefile ./mpd.hosts –np 128 ./hello_world
Intro to PBS Shell Scripting
PBS Shell scripting can be a basic or complex script depending requirements of the job type.
Step1: Specify a shell typeStep2: Specify the PBS Resource DefinitionsStep3: Change to the Working DirectoryStep4: Load compiler and MPI ImplementationStep5: Execute application
To run a 128 core job with 8 MPI threads & use 16 blades
• #PBS -l select=16:ncpus=8:mpiprocs=8
To run a 64 core job with 4 MPI threads & use 16 blades• #PBS -l select=16:ncpus=8:mpiprocs=4
To run a 128 core job with 8 MPI threads & use 16 blades
• #PBS -l select=16:ncpus=8:mpiprocs=8
To run a 64 core job with 4 MPI threads & use 16 blades• #PBS -l select=16:ncpus=8:mpiprocs=4
You *must* specify a PBS walltime, default is set to 1 minute
How to Submit a PBS Job
• PBS Pro supports two modes for job submission– Batch mode
qsub examplejob
or
qsub –l walltime=2:30:00 examplejob
– Interactive (shell) mode for debugging scripts or applications
qsub –I examplejob
Increase runtime +1.5 hoursIncrease runtime +1.5 hours
service0:~> qsub -I examplejobqsub: waiting for job 129301.service0 to startqsub: job 129301.service0 ready
Start Prologue v2.1 Fri Jun 13 07:34:07 MDT 2008 End Prologue v2.1 Fri Jun 13 07:34:07 MDT 2008
scott_shaw@r14i0n6:~>
service0:~> qsub -I examplejobqsub: waiting for job 129301.service0 to startqsub: job 129301.service0 ready
Start Prologue v2.1 Fri Jun 13 07:34:07 MDT 2008 End Prologue v2.1 Fri Jun 13 07:34:07 MDT 2008
scott_shaw@r14i0n6:~>
Tuning Job Placement - Using PBS Resource Definitions
Important concepts– Compute blades/nodes: 2 Quad Cores, 16GB Memory, & Diskless blades– IRU: houses 16 blades per IRU and each Rack has four IRUs – RACK: houses Four IRUs, 64 Compute blades or 512 processor cores L 1 Display
L 1 Display
L 1 Display
L 1 Display
L 1 Display
L 1 Display
L 1 Display
L 1 Display
Blades
IRU
### PBS example script showing job placement#!/bin/bash#PBS -l select=16:ncpus=8:mpiprocs=8#PBS -l place=scatter:excl:group=iru#PBS -l walltime=01:00:00#PBS –N examplejob#PBS -j oe#PBS -V
cd $PBS_O_WORKDIR
rm -f mpd.hostscat $PBS_NODEFILE > mpd.hosts
source /usr/share/modules/init/bashmodule purgemodule load cc/9.1.052 fc/9.1.052 openmpi_intelmodule list
mpirun –machinefile ./mpd.hosts –np 128 ./hello_world
Resource Types:IRU – isolate PBS job within least number of IRUsRACK - isolate PBS job within least number of RacksDefault – without specifying a group= statement any free blade/irc/rack will be used
Resource Types:IRU – isolate PBS job within least number of IRUsRACK - isolate PBS job within least number of RacksDefault – without specifying a group= statement any free blade/irc/rack will be used
RACK
PBS Commands to know
• User Commands:
qsub – the command to submit a PBS queue scriptqhold – place a currently Q or R job on hold “H”qalter – allow changing of submission parametersqrls - release a job from a “H” stateqdel - delete a job from the queue, cmdline opts “–W force {jobid}” qrerun – rerun a previously submitted job, must know jobid
• User Monitoring Commands:
qstat - the command used to check the PBS queuetracejob – command used to review details of current/previous jobspbsnodes – command used to review offline blades/nodes, -l
PBS Commands to know
Monitor the PBS Queue, in a top like output
watch –n 5 “qstat –a”Check the PBS job(s) status in the queue
qstat –s {jobid}Check PBS Job(s) to see which blades/nodes are being used
qstat –n {jobid}Output the job(s) details
qstat –f {jobid}Output the free blades/nodes not in use
pbsnodes -a |grep -B 3 "state = free" |grep Mom |awk '{print $3}'
Questions?