Managing Linux Clusters with Rocks Tim Carlson - PNNL [email protected].

20
Managing Linux Clusters with Rocks Tim Carlson - PNNL [email protected]

Transcript of Managing Linux Clusters with Rocks Tim Carlson - PNNL [email protected].

Managing Linux Clusters with RocksTim Carlson - PNNL

[email protected]

Introduction

Cluster DesignThe ins and outs of designing compute solutions for scientists

Rocks Cluster SoftwareWhat it is and some basic philosophies of Rocks

Midrange computing with Rocks at PNNLHow PNNL uses Rocks to manage 25 clusters ranging from 32 to 1500 compute cores

I Need a Cluster!

Can you make use of existing resources?

chinook 2310 Barcelona CPUs with DDR Infiniband

Requires EMSL proposal

superdome256 core Itanium 2 SMP machine

Short proposal required

Department clustersHPCaNS manages 25 clusters. Does your department have one of them?

Limited amount of PNNL “general purpose” compute cycles

I Really Need a Cluster!

Why?Run bigger models?

Maybe you need a large memory desk side machine. 72G in a desk side is doable (dual Nehalem with 18 x 4G DIMMS)

Do you need/want to run parallel code? Again, maybe a desk side machine is appropriate. 8 cores in single machine

You Need a Cluster

What software do you plan to run?WRF/MM5 (atmospheric/climate)

May benefit from low latency networkQuad core scaling?

NWChem (molecular chemistry)Usually requires a low latency networkNeed an interconnec that is fully supported by ARMCI/GAFast local scratch required. Fast global scratch a good idea

Home Grown Any idea of the profile of your code? Can we have a test case to run on our test cluster?

Processor choices

IntelHarpertown or Nehalem

Do you need the Nehalem memory bandwidth?

AMDBarcelona or Shanghai

Shanghai is a better Barcelona

DisclaimerThis talk was due 4 weeks early. All of the above could have changed in that time

More Hardware Choices

Memory per coreBe careful configuring Nehalem

InterconnectGigE, DDR, QDR

Local disk I/ODo you even use this?

Global file systemAt any reasonable scale you probably aren’t using NFS

Lustre/PVFS2/Panasas

Rocks Software Stack

Redhat basedPNNL is mostly Redhat so the environment is familiar

NFS Funded since 2000Several HPC Wire awardsOur choice since 2001Originally based on Redhat 6.2, now based on RHEL 5.3

Rocks is a Cluster Framework

CustomizableNot locked into a vendor solution

Modify default disk partitioning

Use your own custom kernel

Add software via RPMs or “Rolls”

Need to make more changes?

Update an XML file, rebuild the distribution, reinstall all the nodes

Rocks is not “system imager” basedAll nodes are “installed” and not “imaged”

Rocks Philosophies

Quick to installIt should not take a month (or even more than a day) to install a thousand node cluster

Nodes are 100% configured No “after the fact” tweaking

If a node is out of configuration, just reinstall

Don’t spend time on configuration management of nodesJust reinstall

What is a Roll

A Roll is a collection of software packages and configuration information“Rolls” provide more specific tools

Commercial compiler Rolls (Intel, Absoft, Portland Group)

Your choice of scheduler (Sun Grid Engine, Torque)

Science specific (Bio Roll)

Many others (Java, Xen, PVFS2, TotalView, etc)

Users can build their own Rolls – https://wiki.rocksclusters.org/wiki/index.php/Main_Page

Scalable

Not “system imager” basedNon-homogeneous makes “system imager” types installation problematic

Nodes install from kickstart files generated from a databaseSeveral clusters registered with over 500 nodes

Avalanche installer removes pressure from any single installation server

Introduced in Rocks 4.1Torrent basedNodes share packages during installation

Community and Commercial Support

Active mailing list averaging over 700 posts per monthAnnual “Rocks-A-Palooza” meeting for community members

Talks, tutorials, working groups

Rocks cluster register has over 1100 clusters registered representing more than 720 Teraflops of computational powerClusterCorp sells Rocks+ support based on open source Rocks

PNNL Midrange Clusters

Started in 20018 node VALinux cluster Dual PIII 500Mhz with 10/100 ethernet Chose “Rocks” as the software stack

Built our first “big” cluster that same year64 Dual Pentium III at 1 GhzRebuild all the nodes with Rocks in under 30 minutesParts of this system are still in production

Currently manage 25 clustersRange in size from 16 to1536 coresInfiniband is the primary interconnectAttached storage ranges from 1 to 100 Terabytes

14

HPCaNS Management Philosophy

Create service center to handle moneyCharge customers between $300 and $800/month based on size and complexity

Covers account management, patching, minimal backups (100G), compiler licenses, BigBrother monitoring, general sysadmin

Use .75 FTE to manage all the clusters

“Non-standard” needs are charged by time and materialsAdding new nodes

Rebuilding to a new OS

Software porting or debugging

Complex queue configurations

Support Methods

BigBrother alertsHooks into ganglia checking for

Node outages

Disk usage

Email problems to cluster sysadmins

See next slide after a bad power outage!

Support queueUsers pointed to central support queue

5 UNIX admins watching the queue for cluster items

Try to teach users to use the support queue

Typical Daily Questions

Can you add application X, Y, Z?My job doesn’t seem to be running in the queue?The compiler gives me this strange error!Do you have space/power/cooling for this new cluster I want to buy?This code runs on cluster X, but doesn’t run on cluster Y. Why is that? Aren’t they the same?Can I add another 10T of disk storage?The cluster is broken!

Always Room for Improvement

Clusters live in 4 different computer roomsCan we consolidate?

Never enough user documentationStandardize on resource managers

Currently have various versions of Torque and SLURM

Should we be upgrading older OSes ?Still have RHEL 3 based clusters

Do we need to be doing “shared/grid/cloud” computing?Why in the world do you have 25 clusters?

Questions, comments, discussion!