ROCKS & The CASCI Cluster By Rick Bohn. What’s a Cluster? Cluster is a widely-used term meaning...

ROCKS & The CASCI ClusterROCKS & The CASCI ClusterBy Rick BohnBy Rick Bohn

What’s a Cluster?What’s a Cluster?

• Cluster is a widely-used term meaning independent computers combined into a unified system through software and networking.

• At the most fundamental level, when two or more computers are used together to solve a problem, it is considered a cluster.

Beowulf Cluster?Beowulf Cluster?

Beowulf Clusters are scalable performance clusters based on commodity hardware, on a private system network, with open source software (Linux) infrastructure.

• The designer can improve performance proportionally with added machines.

• The commodity hardware can be any of a number of mass-market, stand-alone compute nodes as simple as two networked computers each running Linux and sharing a file system or as complex as 1024 nodes with a high-speed, low-latency network.

High Performance or High ThroughputHigh Performance or High Throughput

The key questions are: Granularity & Degree of Parallelism• Have you got one big problem or a bunch of little ones?

To what extent can the problem be decomposed into sort-of-independent parts (grains) that can all be processed in parallel?

• Granularity– Fine-grained parallelism – the independent bits are small, need

to exchange information, synchronize often.– Coarse-grained – the problem can be decomposed into large

chunks that can be processed independently.

HPC versus HTCHPC versus HTC• Fine-grained problems need a high performance system

– That enables rapid synchronization between the bits that can be processed in parallel

– Runs the bits that are difficult to parallelize as fast as possible

• Coarse-grained problems can use a high throughput system– It maximizes the number of parts processed per minute

• HPC systems use a smaller number of more expensive processors expensively interconnected and is reliable

• HTC systems use a large number of inexpensive processors, inexpensively interconnected

Other Types of ClustersOther Types of Clusters

1. Highly Available (HA)• Generally small number of nodes• Redundant components• Multiple communication paths.

1. Visualization Clusters• Each node drives a display• OpenGL machines

Cluster ArchitectureCluster Architecture

Frontend Node Public Ethernet

PrivateEthernet Network

Application Network (Optional)

Node Node Node Node Node

Node Node Node Node Node

So What’s a Grid?So What’s a Grid?• The term Grid computing originated in the early 1990s as a

metaphor for making computer power as easy to access as an electric power grid.Today there are many definitions of Grid computing.

• IBM defines Grid Computing as "the ability, using a set of open standards and protocols, to gain access to applications and data, processing power, storage capacity and a vast array of other computing resources over the Internet. A Grid is a type of parallel and distributed system that enables the sharing, selection, and aggregation of resources distributed across 'multiple' administrative domains based on their (resources) availability, capacity, performance, cost and users' quality-of-service requirements"

• Grids can be categorized with a three stage model of departmental Grids, enterprise Grids and global Grids.

NYSGrid StatusNYSGrid Status

Things to ConsiderThings to Consider

Clusters are phenomenal price/performance computational engines.

However• They can be hard to manage without experience• High-performance I/O is still evolving• Finding out where something has failed increases at

least linearly as cluster size increases• Not cost-effective if every cluster “burns” a person just

for care and feeding• Programming environment could be vastly improved• Technology is changing very rapidly. Scaling up is

becoming commonplace

CASCI ClusterCASCI Cluster

Center for Advancing the Study of Center for Advancing the Study of CyberinfrastructureCyberinfrastructure

(CASCI)(CASCI)

Guy Johnson, DirectorGuy Johnson, Director

CASCI Cluster HardwareCASCI Cluster Hardware

Head Node (1)• IBM xSeries 345• 1 GB Ram• 2 Pentium 4 2.0 GHz• 6 Hard Drives 36 GB

(Internal RAID 5)• 2 Gig Ethernet Ports

Compute Nodes (47)• IBM xSeries 330• 512 MB Ram• 2 Pentium 3 1.4 GHz• 1 36 GB Hard Drive• 1 Gig Ethernet Port

NYSGrid Cluster HardwareNYSGrid Cluster Hardware

Head Node (1)• IBM xSeries 330• 768 MB Ram• 2 Pentium 3 1.4 GHz• 1 36 GB Hard Drive• 2 Fast Ethernet Ports

Compute Nodes (4)• IBM xSeries 330• 512 MB Ram• 2 Pentium 3 1.4 GHz• 1 36 GB Hard Drive• 1 Fast Ethernet Port

Experimental global grid cluster connected to other universities within New York state.

CASCI Cluster NETWORKCASCI Cluster NETWORK

The local network (eth0) is gigabit Ethernet using an Extreme Networks 6808 gigabit switch.

CASCI Cluster ImagesCASCI Cluster Images

The Great Wall of Cluster!The Great Wall of Cluster!

Cluster courtesy of Paul Mezzanini

Located behind CASCI Cluster racks

ROCKSROCKSClustering SoftwareClustering Software

ROCKS CollaboratorsROCKS Collaborators

• San Diego Supercomputer Center, UCSD • Scalable Systems Pte Ltd in Singapore • High Performance Computing Group, University

of Tromso • The Open Scalable Cluster Environment,

Kasetsart University, Thailand • Flow Physics and Computation Division,

Stanford University • Sun Microsystems • Advanced Micro Devices

ROCKS Cluster SoftwareROCKS Cluster Software

Goal: Make Clusters Easy!

1. Easy to deploy, manage, upgrade and scale.

1. Help deliver the computational power of clusters to a wide range of scientific users.

• Making stable and manageable parallel computing platforms available to a wide range of scientists will aid immensely in improving the state of the art in parallel tools.

Supported PlatformsSupported Platforms

ROCKS - is built on top of RedHat Linux releases (CentOS) - supports all the hardware components that RedHat supports, but only supports the x86, x86_64 and IA-64 architectures.

Processors• x86 (ia32, AMD Athlon, etc.) • x86_64 (AMD Opteron and EM64T) • IA-64 (Itanium)

Networks• Ethernet (All flavors that RedHat supports, including Intel Gigabit Ethernet) • Myrinet (provided by Myricom) • Infiniband (provided by Voltaire)

Minimum Hardware RequirementsMinimum Hardware Requirements

Frontend Node• Disk Capacity: 20 GB • Memory Capacity: 512 MB (i386) and 1 GB (x86_64) • Ethernet: 2 physical ports (e.g., "eth0" and

"eth1")

Compute Node• Disk Capacity: 20 GB • Memory Capacity: 512 MB • Ethernet: 1 physical port (e.g., "eth0")

ROCKS DistributionROCKS Distribution

• The ROCKS software is bundled into various packages called “Rolls” and put on CDs.

• Rolls are specially compiled to fit into the ROCKS installation methodology.

• Rolls are classified as either mandatory or optional.

• Rolls cannot be installed after the initial installation.

ROCKS Base RollsROCKS Base Rolls

The minimum requirements to bring up a frontend is to have the following Rolls.

• Kernel/Boot Roll • Core Roll (Base, HPC, Web-server)

OR BASE, HPC & Web-server Rolls• Service Pack Roll • OS Roll - Disk 1 • OS Roll - Disk 2

ROCKS Optional RollsROCKS Optional Rolls

The optional Rolls are:– Core Roll

• Area 51 (chkrootkit and tripwire)• Ganglia (system monitoring software)• Grid (software for connecting clusters)• Java (Sun Java SDK and JVM)• SGE (Sun Grid Engine scheduler)

– Bio (bioinformatics utilities (release 4.2))– Condor (high throughput computing tools)– PBS (portable batch scheduling software)– PVFS2 (parallel virtual file system version 2)– VIZ (visualization software)– Voltaire (Infiband support for Voltaire IB hardware)

ROCKS Software StackROCKS Software Stack

The Head NodeThe Head Node

• Users login, submit jobs, compile code, etc

• Uses two Ethernet interfaces– one public, one private for compute nodes

• Normally has lots of disk space (system partitions < 14 GB)

• Provides many system services– NFS, DHCP, DNS, MySQL, HTTP, 411, Firewall,etc

• Cluster configuration

Compute NodesCompute Nodes

• Basic compute workhorse• Lots of memory (if lucky)

• Minimal storage requirements

• Single Ethernet connection for private LAN

• Disposable

• OS easily re-installed from head node

• Nodes can be heterogeneous

NFS in ROCKSNFS in ROCKS

• User accounts are served over NFS– Works for small clusters (< 128 nodes)– Will not work for large clusters (>1024)– NAS tends to work better

• Applications are not served over NFS– /usr/local does not exist– All software is installed locally (/opt)

411 Secure Information Service411 Secure Information Service

• Provides NIS-like functionality

• Securely distributes password files, user and group configuration files and the like using Public Key Cryptography to protect file content.

• Uses HTTP to distribute the files

• Scalable, secure and low latency

411 Architecture411 Architecture

1. Client nodes listen on the IP broadcast address for “411 alert” messages from the head node.

2. Nodes then pull the file from the head node via HTTP after some delay to avoid flooding the master with requests.

As Simple as 411As Simple as 411

To make changes to the 411 system you simply use “make” and the 411 “Makefile” similar to NIS.

• To publish 411 changes, on the head node run the command: 411put

• To retrieve 411 changes, on the compute node run the command: 411get

or on the head node: cluster-fork 411get --all

Ganglia MonitoringGanglia Monitoring

• Ganglia is a scalable distributed monitoring system for high-performance computing systems such as clusters and grids

• It leverages widely used technologies such as XML for data representation, XDR for compact, portable data transport, and RRDtool for data storage and visualization.

• It uses carefully engineered data structures and algorithms to achieve very low per-node overheads and high concurrency.

• Provides a heartbeat to determine compute node availability.

Cluster Status with GangliaCluster Status with Ganglia

Security ToolsSecurity Tools

• Tripwire runs everyday and emails results.

• Chkrootkit is available and is manually executed

• Iptables is used as the firewall. Only trusted networks are allowed access.

Job ManagementJob Management• Not recommended to run jobs directly!

– Can hog cluster/nodes– No accountability

• Use installed job scheduler– You can submit multiple jobs and have it queued

(and go home!)– Fair Share

Allow other people to use the cluster also!

- Accountability

CASI Cluster Users!Without job management

Scheduling SystemsScheduling Systems• Sun Grid Engine (default scheduler)

– Rapidly becoming the new standard– Integrated into Rocks by Scalable System– Now the default scheduler for Rocks– Robust, dynamic and heterogeneous– Currently using 6.0

• Portable Batch System(torque) and Maui– Long time standard for HPC queuing systems– Maui provides backfilling for high throughput– PBS/Maui system can be fragile and unstable– Multiple code bases:

• PBS, OpenPBS, etc

• Condor – high throughput computing ( currently under evaluation)

Sun Grid Engine (SGE)Sun Grid Engine (SGE)

• SGE is resource management software

– Accepts jobs submitted by users

– Schedules them for execution on appropriate systems based on resource management policies

– Can submit 100s of jobs without worrying where it will run

– Supports serial as well as parallel jobs

SUN Grid Engine VersionsSUN Grid Engine Versions• SGE Standard Edition

– Linux cluster

• SGE Enterprise Edition– when you want to aggregate a few clusters together and

manage them as one resource

– When you want sophisticated policy management• User/Project share• Deadlines• User, Department, Project level

Rocks comes standard with SGE Enterprise 6.0

Cluster Web Site Cluster Web Site (http://cluster.rit.edu)(http://cluster.rit.edu)

Requesting an AccountRequesting an Account

Accessing the ClusterAccessing the Cluster

Access the cluster via an SSH client

• PuTTY

• SSH Secure Shell

• X-Win32

• F-Secure

To transfer data to the cluster use either scp or sftp.

Windows users can download and use WinSCP (http://winscp.net)

Available ApplicationsAvailable Applications

• BLAST (basic local alignment search tool for bio research)

• ENVI / IDL Data Visualization Software• GCC (C, C++, Fortran programming)• Mathematica (licensing limitations)

• Matlab (licensing limitations)

• mpiBLAST (parallel version of BLAST)• MPICH (MPI parallel programming)

Other Alternatives to ROCKSOther Alternatives to ROCKS

Clustering Software• Perceus / Warewulf (www.warewulf-cluster.org)• openMosix Project (openmosix.sourceforge.net)• Score Cluster System (www.pcluster.org)• OSCAR (oscar.openclustergroup.org)

System Imaging / Configuration Software• System Imager (wiki.systemimager.org)• Cfengine (www.cfengine.org)• LCFG (www.lcfg.org)

THANK YOUTHANK YOU

A Bad to the Bohn Production

ROCKS & The CASCI Cluster By Rick Bohn. What’s a Cluster? Cluster is a widely-used term meaning...

Documents

Transcript of ROCKS & The CASCI Cluster By Rick Bohn. What’s a Cluster? Cluster is a widely-used term meaning...